LAAL cri Vs NZMa. A.) 


of 


STATISTIC 
CE- [onc ep 4 —posos t 


INSTITUTE OF MATHEMATICAL STATISTICS 


Models in the Physical Sciences 
High-resolution asymptotics for the A PEDES 
of spherical random fields .. . ... 2s. . DOMENICO MARINUCCI 
Extended staustical modeling under ivinmeteys 
the link toward quantum mechanics... . ..... sser oo os ^ INGE S. HELLAND 


Parametric Inference 
Improved minimax predictive densities under Kullback—Leibler loss 
EDWARD I. GEORGE, FENG LIANG AND XINYI XU 


Sequential change-point detection when unknown parameters 
are present in the pre-change distribution. ..... ss. a serere verre .. YAJUN MEI 


Spatial Statistics 
Consistent estimation of the basic neighborhood 


of Markov random fields. ... . ... . .... .IMRE CSISZÁR AND ZSOLT TALATA 


Spatial extremes: Models for the 
stationary case. . . . sss -. LAURENS DE HAAN AND TERESA T PEREIRA 


Nonparametric and Semiparametric Inference 
Penalized maximum likelihood and semiparametric second-order efficiency 
A. S DALALYAN, G. K GOLUBEV AND A. B. TSYBAKOV 
Adaptive confidence balls. ...... .. ..... .... . T. TONY CAI AND MARK G. LOW 
Adaptive nonparametric confidence sets. . JAMES ROBINS AND AAD VAN DER VAART 
Serial and nonserial sign-and-rank statistics: Asymptotic representation and asymptotic 


normality ... .. MARC HALLIN, CATHERINE VERMANDELE AND BAS WERKER 
Local partial-likelihood estimation 
for hfetime data. .. . . . ... HANQING FAN, HUAZHEN LIN AND YONG ZHOU 


Adaptive multiscale detection of filamentary structures in a background of uniform 
random points. . ERY ARIAS-4°" "DAVID L. DONOHO AND XIAOMING HUO 


Optimal change-point estimation fro o. 
indirect observations . ... A. (jS PPUBER, A. TSYBAKOV AND A ZEEVI 
Muli p:e Testing Problems 
Estimating the proportion of false null yen among a large number 
of independently tested hypotheses. ... NICOLAI MEINSHAUSEN AND JOHN RICE 
False discovery and false nondiscovery rates in meiner eee 
multiple testing procedures . .* . reece s. SANAT K. SARKAR 
Bayesian Inference 
Poisson calculus for spatial neutral to the right processes.. ... . .LANCELOT F. JAMES 


Nonsubjective priors via predictive relative entropy regret 
TREVOR J. SWEETING, GAURI S. DATTA AND MALAY GHOSH 


Asymptotics 
Asymptotic normality of extreme value 
estimators on C[0, 1].. ... ... . ... .. . .JOHNH J. EINMAHL AND TAO LIN 


Stable limits of martingale transforms with apphcation to the estimation 
of GARCH parameters .... . . . THOMAS MIKOSCH AND DANIEL STRAUMANN 


Importance Sampling and Design 


Sequential importance sampling for mult:way tables 
YUGUO CHEN, IAN H. DINWOODIE AND SETH SULLIVANT 


Doubling and projection: A method of constructing two-level designs of resolution IV 
HEGANG H. CHEN AND CHING-SHUI CHENG 


A. 





42 


78 


92 


123 


146 


169 
202 
229 


254 
290 
326 


350 


373 


394 


416 


441 


469 


493 


523 


546 


INSTITUTE OF MATHEMATICAL STATISTICS 
(Organized September 12, 1935) 


The purpose of the Institute is to foster the development and dissemination of the theory and 
applications of statistics and probability. 


OFFICERS AND EDITORS 
President: Thomas G Kurtz, Department of Mathematics, University of Wisconsin-Madison, Madison, 
Wisconsin 53706-1388 
aas T assis Pitman, Department of Statistics, University of Califorma, Berkeley, California 


Past President: Louis H. Y Chen, National University of Singapore, Institute for Mathematical Sciences, 3 Prince 
George’s Park, Singapore 118402 

Executive Secretary: Cindy L Christiansen, Department of Health Services, Boston University, 200 Springs 
Road (152), Bedford, Massachusetts 01730 

Treasurer: Jiayang Sun, Department of Statistics, Case Western Reserve University, Cleveland, Ohio 
44106-7054 

Program Secretary: Andrew Nobel, Department of Statistics, University. of North Carolina, New West 
Building, Chapel Hill, North Carolina 27599-3260 

Editors, The Annals of Statistics: Moms L Eaton, Department of Statistics, University of Minnesota, 224 Church 
Street S E, Minneapolis, Minnesota 55455, hanqing Fan, Department of Operations Research and 
Financial Engineering, Princeton University, Princeton, New Jersey 08544 

Editor, The Annals T Probability: Gregory F Lawler, Department of Mathematics, Cornell University, Malott Hall, 
Ithaca, New York 14853-4201 

Editor, The Annals of Applied Probability. Edward C. Waymire, Department of Mathematics, Oregon State 
University, Corvallis, Oregon 97331 

Editor, Statistical Science. Edward I George, Department of Statistics, University of Pennsylvania, Philadelphia, 
Pennsylvania 19104-6340 

Editor, The IMS Bulletin: Bernard Silverman, St Peter's College, Oxford OX1 2DL, United Kingdom 

Editor, IMS Lecture Notes—-Monograph Series. Richard Vitale, Department of Statistics, University of Connecticut, 
U-120, Storrs, Connecticut 06269 

Editor, JMS Web Page: Hemant Ishwaran, Department of Biostatistics and Epidemiology, Cleveland Clinic 
Foundation, Cleveland, Ohio 44195 

Mana Editor, Statistics. Paul Shaman, Department of Staustcs, University of Pennsylvania, 

hiladelphia, Pennsylvania 19104-6340 

Managing Editor, Probability; Michael Phelan, Department of Mathematics, Chapman University, 

One University Drive, Orange, California 92866 


Journals. The scientific journals of the Institute are The Annals of Applied Probability, The Annals of 
Probability, The Annals of Statistics and Statistical Science The news organ of the Institute 1s The 
Institute of Mathematical Statistics Builetin 


Individual and O zational Memberships. All individual members pay basic membership dues 
of $75 Each individual member must elect to receive at least one scientific journal for an additional 
amount, as follows. The Annals of Applied Probability ($20), The Annals of Probability ($25), The Annals 
of Statistics ($30) or Statistical Science ($15) Of the total dues paid, $29 1s allocated to The IMS Bulletin 
and the remaining amount is allocated equally among the scientific journal(s) received, Reduced mem- 
bership dues are available to full-time students, permanent residents of countnes designated by the IMS 
Council and retired members Retired members may elect to receive the Bulletin only for $26 
Organizational memberships are available to nonprofit organizations at $545 per year and to for-profit 
Eds y cune at $850 per year. Organtzauonal memberships include two multiple-readership copies of 
2 IMS De in addition to other benefits specified for each category (details available from the IMS 
usiness Office 


Individual and General Subscriptions. Subscriptions are available on a calendar-year basis. Individual 
subscriptions are for the personal use of the subscriber and must be in the name of, paid directly by, and 
mailed to an individual Individual subscriptions for 2006 are available to The Annals of Applied Probability 
($95), The Annals of Probability ($100), The Annals of Statistics ($105), The IMS Bulletin. ($60) 
and Stanstical Science ($90). General subscriptions are for lbranes, institutions and any multiple- 
readership use. General subscriptions for 2006 are available to The Annals of Applied Probability ($150), The 
Annals of Probability ($250), The Annals of Statistics ($250), The IMS Bulletin ($70) and Statistical Science 
($140) mail rates for delivery outside North America are $80 per title (excluding The IMS Bulletin) 
Permissions Policy. Authonzation to photocopy 1tems for 1nternal or personal use 1s granted by the Institute 
of Mathematical Statistics. For multiple copies or reprint permission, contact The Copyright Clearance Center, 
222 Rosewood Drive, Danvers, Massachusetts 01923 Telephone (978) 750-8400. http.//www.copyright.com. If 
the permission 1s not found at the Copyright Clearance Center, please contact the IMS Office ums imstat org. 


Correspondence. Mail concerning membership, subscriptions, nonreceipt claims, copyright permissions, adver- 
tising or back issues should be sent to the IMS Pues and Subscription Office, 9650 Rockville Pike, Suite L 2310, 
Bethesda, Maryland 20814-3998 Mail concerning submissions or editorial content should be sent to the Editor 
of the appropriate journal Their addresses are listed above. Mail concerning the production of this journal should 
be sent to. Patrick Kelly, IMS Production Editor, DA ada of Statistics, The Wharton School, University of 
Pennsylvania, Philadelphia, Pennsylvania 19104-634 
The Annals of Statistics (ISSN 0090-5364), Volume 34, Number 1, February 2006 Published bimonthly 
by the Institute of Mathematical Statistics, 3163 Somerset Drive, Cleveland, Ohio 44122, USA Periodicals 
postage paid at Cleveland, Ohio, and at additional mailing offices. 
POSTMASTER: Send address changes to The Annals of Statistics, Institute of Mathematical Statistics, 
Dues and Subscriptions Office, 9650 Rockville Pike, Suite L 2310, Bethesda, Maryland 20814-3998 

Copynght © 2006 by the Institute of Mathematical Statistics 

Printed in the United States of America 


EDITORS , 


MORRIS L. EATON JIANQING FAN 


ASSOCIATE EDITORS 


YACINE AIT-SAHALIA VLADIMIR KOLTCHINSKII PETER M. ROBINSON ` 


T TONY CAI MICHAEL R. KOSOROK Qi-MAN SHAO 
MING-YEN CHENG JANT A KOSTER XIAOTONG SHEN 
DoROTA M. DABROWSKA = JUN LIU VLADIMIR SPOKOINY 
' RAINER DAHLHAUS ENNO MAMMEN MICHAEL STEIN 
ANIRBAN DASGUPTA ADAM MARTINSEK SARA VAN DE GEER 


HOLGER DETTE THOMAS MATHEW MARK VAN DER LAAN 


HOLGER DREES RAHUL MUKERJBE LARRY WASSERMAN 
LUTZ DUMBGEN HANS-GEORG MULLER MICHAEL WOLF 
CHARLES J. GEYER PER A. MYKLAND QIWEI YAO 

IRENE GIJBELS DOMINIQUE PICARD YI-CHING YAO 


DONALD ST. P RICHARDS BIN YU 
CHRISTIAN P. ROBERT RUBEN H ZAMAR 


PETER HALL 
XUMING HE 


EDITORIAL COORDINATORS 
MARY BETH FALKE ' 


MANAGING EDITOR (p 2 0 Q 9 6 PRODUCTION EDITOR 


PAUL SHAMAN PATRICK KELLY 


SARAH JOHNSON SEXTON 


PAST EDITORS 
THE ANNALS OF MATHEMATICAL STATISTICS 


H. C. CARVER, 1930-1938 WILLIAM KRUSKAL, 1958-1961 . 
S. WILKS, 1938-1949 I. L. HODGES, JR., 1961-1964 
ANDERSON, 1950—1952 D.L BURKHOLDER, 1964—1967 

LEHMANN, 1953-1955 Z.W BIRNBAUM, 1967-1970 
HARRIS, 1955-1958 INGRAM OLKIN, 1970-1972 


THE ANNALS OF STATISTICS 


MICHAEL WOODROOFE, 1992-.1994 
LAWRENCE D. BROWN AND 


INGRAM OLKIN, 1972-1973 
I R SAVAGE, 1974—1976 


RUPERT G MILLER, JR., 1977-1979 
DAVID V. HINKLEY, 1980-1982 
MICHAEL D. PERLMAN, 1983-1985 
WILLEM R VAN ZWET, 1986-1988 


JOHN A RICE, 1995-1997 
JAMES O. BERGER AND 
HANS R. KUNSCH, 1998-2000 
JOHN I. MARDEN AND 


ARTHUR COHEN, 1989-1991 JON A WELLNER, 2001~2003 


EDITORIAL POLICY 


The Annals of Statistics aims to publish research papers of highest quality reflecting the many facets of con- 
temporary statistics. Primary emphasis 1s placed on importance and originality, not on formalism 

The discipline of statistics has deep roots in both mathematics and in substantive scientific fields 
Mathematics provides the Janguage 1n which models and the properties of statistical methods are formulated. 
It ıs essential for rigor, coherence, clarity and understanding. Consequently, our policy 1s to continue to play a 
special role in presenting research at the forefront of mathematical statistics, especially theoretical advances 
that are likely to have a significant umpact on statistical methodology or understanding. Substantive fields are 
essential for continued vitality of statistics, since they provide the motivation and direction for most of the 
future developments ın statishcs We thus intend to also publish papers relating to the role of statistics in inter- 
disciplinary investigations 1n all fields of natural, technical and social sciences A third force that 1s reshaping 
statistics is the computational revolution, and the Annals will also welcome developments in this area. 
Submissions in these two latter categories will be evaluated primarily by the relevance of the issues addressed 
and the creativity of the proposed solutions. 

Lucidity and conciseness of presentation are important elements in the evaluation of submissions. The intro- 
duction of each paper should be accessible to a wide range of readers It should thus discuss the context and 
importance of the issues addressed and give a clear, nontechnical description of the main results In some papers 
it may, for example, be appropriate to present special cases or specific examples prior to general, abstract for- 
mulationg, while in other papers discussion of the general scientific context of a problem might be a helpful pre- 
lude to the body of the paper 


Manuscripts submitted to The Annals of Statistics should be serit to: 
Morris L (Joe) Eaton and Jianging Fan 
Editors 
University of Minnesota 
School of Statistics, Ford 313 
224 Church St. S.E 
Minneapolis, MN 55455 

or. by E-MAIL to annals@amnstat stat.umn edu 


ina V NE fi WEES 


ACADEMIA SINICA 

ARIZONA STATE UNIVERSITY 
AUSTRALIAN NATIONAL UNIVERSITY 
BATH UNIVERSITY 


BATTELLE PACIFIC 
NORTHWEST NATIONAL LABORATORY 


BOWLING GREEN UNIVERSITY 

CARLETON UNIVERSITY 

CENTRUM VOOR WISKUNDE BN INFORMATICA 
CALIFORNIA STATE UNIVERSITY, EAST BAY 


CHALMERS UNIVERSITY OF TECHNOLOGY 
& GOTEBORG UNIVERSITY 


CORNELL UNIVERSITY 

DUKE UNIVERSITY 

EINDHOVEN UNIVERSITY OF TECHNOLOGY 
FIOCRUZ—FUNDACAO OSWALDO CRUZ 
FLORIDA STATE UNIVERSITY 

FU JEN CATHOLIC UNIVERSITY 

HARVARD UNIVERSITY 

HIROSHIMA UNIVERISTY 

INDIAN INSTITUTE OF TECHNOLOGY 
INDIAN STATISTICAL INSTITUTE 

INDIANA UNIVERSITY 

INSTITUTE FOR DEFENSE ANALYSIS 

IOWA STATE UNIVERSITY 

ISTITUTO PER LE APPLICAZIONI DEL CALCOLO 
JOHNS HOPKINS UNIVERSITY 

KANSAS STATE UNIVERSITY 


LONDON SCHOOL OF ECONOMICS 
& POLITICAL SCIENCE 


LUND UNIVERSITY 


MASSACHUSETTS INSTITUTE 
OF TECHNOLOGY 


MATHEMATICAL SCIENCES 
RESEARCH INSTITUTE 


MCGILL UNIVERSITY 

MEDICAL COLLEGE OF WISCONSIN 
MEMORIAL SLOAN KETTERING CANCER CENTER 
MICHIGAN STATE UNIVERSITY 
MINNESOTA STATE UNIVERSITY 
NANZAN UNIVERSITY 

NATIONAL SCIENCE FOUNDATION 
NATIONAL CENTRAL UNIVERSITY 
NATIONAL CHENG KUNG UNIVERSITY 
NATIONAL CHIAO TUNG UNIVERSITY 
NATIONAL SECURITY AGENCY 

NEW MEXICO STATE UNIVERSITY 
NORTH CAROLINA STATE UNIVERSITY 
NORTH DAKOTA STATE UNIVERSITY 
NORTHERN ILLINOIS UNIVERSITY 


NOTTINGHAM TRENT UNIVERSITY 
OREGON STATE UNIVERSITY 
PENNSYLVANIA STATE UNIVERSITY 

PFIZER INC 

PRINCETON UNIVERSITY 

PURDUE UNIVERSITY 

QUEENS UNIVERSITY 

RICE UNIVERSITY 

ROCKEFELLER UNIVERSITY 

RUTGERS UNIVERSITY 

SIEGEN UNIVERSITY 

SOUTHERN ILLINOIS ÜNIVERSITY 
STOCKHOLM UNIVERSITY 

TECHNISCHE UNIVERSITAT 

TEXAS A&M UNIVERSITY 

TEXAS TECH UNIVERSITY 

UNITED STATES, DEPARTMENT OF DEFENSE 
UNIVERSIDAD AUTÓNOMA DE MADRID 
UNIVERSIDADE DE COIMBRA 

UNIVERSITA COMMERCIALE LUIGI BOCCONI 
UNIVERSITA DELGI STUDI DI PADOVA 
UNIVERSITA DELGI STUDI DI ROMA LA SAPIENZA 
UNIVERSITAT BERN 

UNIVERSITAT KARLSRUHB 

UNIVERSITAT MUNSTER 

UNIVERSITAT ZU LUBECK 

UNIVERSITY OF ALBERTA 

UNIVERSITY OF ARIZONA 

UNIVERSITY OF BRITISH COLUMBIA 
UNIVERSITY OF CALGARY 

UNIVERSITY OF CALIFORNIA, IRVINE 
UNIVERSITY OF CALIFORNIA, LOS ANGELES 
UNIVERSITY OF CALIFORNIA, SAN DIEGO 
UNIVERSITY OF CALIFORNIA, SANTA CRUZ 
UNIVERSITY OF CONNECTICUT 
UNIVERSITY OF DENVER 

UNIVERSITY OF EDINBURGH 

UNIVERSITY OF FLORIDA 

UNIVERSITY OF GEORGIA 

UNIVERSITY OF ILLINOIS 

UNIVERSITY OF IOWA 

UNIVERSITY OF MASSACHUSETTS 
UNIVERSITY OF MICHIGAN 

UNIVERSITY OF MINNESOTA 

UNIVERSITY OF MISSISSIPPI 

UNIVERSITY OF MISSOURI 

UNIVERSITY OF MONTREAL 

UNIVERSITY OF NEW BRUNSWICK 


AA. Me Bett ^ —JALBAS 3A. uo roe b PS 


| 
a. Ah AM BRD ac eee 


INSTRUCTIONS FOR AUTHORS 


Submission of Papers. To submit an article to be con- 
sidered for publication, e-mail a copy in postscript or 
pdf form to annals@annstat stat.umn.edu All submus- 
sions should be accompanied by a cover letter or e-mail 
that gives the corresponding author's name, e-mail ad- 
dress and regular mailing address. One of the two edi- 
tors will have responsibility for the review of your paper 
You will be sent an acknowledgment that your paper 
has been received. (Contact one of the editors if you 
are unable to submit using postscript or pdf files ) Au- 
thors are strongly encouraged to consult the Annals web 
page http://www imstat.org/aos/ for detailed instructions. 
In particular, see the "Suggested Referees Option" sec- 
tion before submitting a paper 


Preparation of Manuscripts. Authors should check a 
recent issue of the Annals for style. Typesetting is fa- 
cilitated if manuscripts are written using some form of 
TEX (AMSTEX, IATER, etc.). Further information can be 
obtained from Attp./Avww.imstat.org/aos/. Please see the 
IATEX. support page for IMS publications to use the IMS 
recommended template 


Submission of Reference Papers. Unpublished or not 
easily available papers cited in the manuscript should be 
submitted with the manuscript 


Title. The title should be descriptive and as concise 
as 1s feasible, that is, it should indicate the topic of the 
paper as clearly as possible, but every word 1n 1t should 
be pertinent. 


Abbreviated Title. An abbreviated title to be used 
as a running head is also required. This should 
normal not exceed 35 characters. For example, an 
article with the ttle “The Curvature of a Statistical 
Model, with Applications to Large-Sample Likelihood 
Methods,” could have the running head, “Curvature 
of Statistical Model" or possibly “Asymptotics of Like- 
lihood Methods,” depending on the emphasis to be con- 
veyed. 


Affiliation. Indicate your present institutional affiliation 
as you would like it to appear. 


Summary. Each manuscript is required to contain a 
summary, clearly separated from the rest of the 
paper, which will be pnnted immediately after the 
title. Its main purpose is to inform the reader quickly 
of the nature and results of the paper; it may also be 
used as an aid in retrieving information. The length 
of a summary will clearly depend on the length and 
difficulty of the paper, but ın general ıt should not 
exceed 150 words. Formulas should be used as spar- 
ingly as possible within the summary. The summary 
should not make reference to results or formulas in 
the body of the paper—it should be self-contained 


Footnotes. Footnotes should not be used, except as 
described under Title Page Footnotes below. Such 
information should be included within the text. 


Title Page Footnotes. Included as a footnote on page 1 
should be the headings: AMS 2000 subject classifica- 
tons Primary-; secondary-. Key words and phrases 

The classification numbers representing the primary 
and secondary subjects of the article may be found at 
http:/Avww ams.org/msc/ The key words and phrases 
should describe the subject matter of the article; gener- 
ally they should be taken from the body of the paper 

Acknowledgment of support, grants and contracts 
should also be included 1n this footnote. 


Figures. Figures are best prepared as separate postscript 
or encapsulated postscript files and should be included 
with the manuscript 


Formulas. Fractions in the text are preferably wntten 
with the solidus or negative exponent; thus, (a + b)/ 
(c +d) is preferred to $15, and (2x)7! or 1/(2x) to 5b. 
Also, a?© and Gp(c) are preferred to a". and dp, , Tespec- 
tively Complicated exponentials should be represented 
with the symbol exp. A fractional exponent ts preferable 
to a radical sign 


References. Citations in text should be numbered 
" .using examples shown in [1] " and the bibliogra- 
phy should be typed doublespaced and should follow the 
style. 

[1] LAMPORT, L (1994) BIFX. A Document Prepa- 
ration System, 2nd ed Addison—Wesley, Reading, MA. 

[2] KIEFER, J C. (2002), Admissibility of conditional 
confidence procedures. Ann. Statist 30 836-865 

Abbreviations for journals should be taken from a 
current index issue of Mathematical Reviews or from 
http /Avww.ams org/msnhtml/serials.pdf 


Copyright, Page Charges and Offprints. Page charges 
are $45 per printed page Payment of some or all 
of the estimated page charges associated with articles is 
strongly encouraged. The editorial review of articles and 
administration of page charges are completely separate 
activities Manuscripts are reviewed and accepted prior 
to determining whether page charges will be paid. 

Every corresponding author will receive a pdf file via 
e-mail of the article You do not need to do anything to re- 
ceive this file, it will happen automatically. Offprints may 
be purchased by using the IMS Offprint Purchase Order 
Form accompanying the galleys. 

Copynght transfer, page charges and offprints are 
the responsibility of the IMS Business Manager. 


Galley Proofs. Authors will receive e-mail notification 
when galleys are ready and have the option of either 
downloading a pdf version of the article or having it sent 
by regular mail. Similarly, authors may return corrections 
either by e-mail or regular maul 


Correspondence. All correspondence with the editor 
must refer to the manuscript number of the paper This 
number will be sent to the author acknowledging receipt 
of the article. 


ae andat ae 


Cer - 


“< The Annals of Statistics 

2006, Vol M, No 1, 1-41 

DOI [0 1214/009053605000000903 

© institute of Mathematical Statistics, 2006 


t 


HIGH-RESOLUTION ASYMPTOTICS FOR THE ANGULAR 
BISPECTRUM OF SPHERICAL RANDOM FIELDS! 


BY DOMENICO MARINUCCI 
Università di Roma “Tor Vergata” 


In this paper we study the asymptotic behavior of the angular bispectrum 
of spherical random fields. Here, the asymptotic theory 1s developed in the 
framework of fixed-radius fields, which are observed with increasing resolu- 
tion as the sample size grows. The results we present are then exploited in a 
set of procedures aimed at testing non-Gaussianity; for these statistics, we are 

^ able to show convergence to functionals of standard Brownian motion under 
the null hypothesis. Analytic results are also presented on the behavior of the 
tests in the presence of a broad class of non-Gaussian alternatives. The issue 
of testing for non-Gaussianity on spherical random fields has recently gained 
enormous empirical importance, especially in connection with the statistical 
analysis of cosmic microwave background radiation. 


1. Introduction. Several statistical challenges are now arising in connection 
with cosmological data, and more precisely, for the analysis of cosmic microwave 
background radiation (CMB). CMB can be viewed as a snapshot of the Universe 
approximately 3 x 10? years after the Big Bang; technological progress has made 
possible a number of experiments aimed at measuring the properties of this radia- 
tion. Pioneering results were released in 1992 by the NASA mission COBE [31], 
which was the first to release a full-sky map of CMB fluctuations; the statistical 

| properties of these fluctuations were then further investigated by several balloon 

experiments, starting with BOOMERanG [8] and MAXIMA [14]. A major break- 

through is associated with two satellite missions, namely WMAP [4], which re- 

leased its first data set in February 2003, with much more detailed data to come 

Wing in the next four years, and Planck, which is scheduled to be launched in Spring 

l 2007 and expected to provide maps with much greater resolution. Over the next 

ten years, an immense amount of cosmological information is expected from these 

huge data sets; statistical efforts needed to extract this information are equally 
challenging and impressive. 

From the mathematical point of view, CMB can be represented as a random field 

T (0, ~) indexed by the unit sphere S*, that is, for each azimuth 0 < 0 < and 


Received March 2004, revised March 2005. 
I Supported by MIUR 
AMS 2000 subject classifications. Primary 60660; secondary 60F17, 62M15, 85A40. 
Key words and phrases. Spherical random fields, bispectrum, Gaussianity, cosmic microwave 
background radiation 


1 


* 


2 D. MARINUCCI 


elongation 0 < p < 27, T (0, p) 1s a real random variable defined on some proba- 
bility space. We shall always assume that T (6, œ) is a zero-mean, finite-variance, 
mean-square continuous and isotropic random field, that is, its distribution is in- 
variant with respect to the group of rotations. Until very recently, the assumption 
of an isotropic Universe has been taken for granted in cosmological physics, as a 
consequence of Einstein's Cosmological Principle that the Universe should “ap- 
pear the same" to an observer located anywhere in space. Quite interestingly, the 
first release of WMAP has raised some doubts on this condition [10, 15, 27]. Test- 
ing for isotropy is a very interesting topic, almost completely open for statistical 
research; we consider this issue, however, beyond the purpose of the present work. 

If an isotropic field is Gaussian, its dependence structure is completely identi- 
fied by the angular correlation function and its harmonic transform, the angular 
power spectrum (to be defined in the next section). For non-Gaussian fields, the 
dependence structure becomes much richer, and higher-order correlation functions 
are of interest; in turn, this leads to the analysis of so-called higher-order angu- 
lar power spectra. Because these angular power spectra are identically zero for 
Gaussian fields, they also provide natural tools to test for non-Gaussianity: this is 
a topic of greatest importance in modern cosmological data analysis. Indeed, on 
one hand the validation of the Gaussian assumption is urged by the necessity to 
provide a firm basis for statistical inference on cosmological parameters, which is 
dominated by likelihood approaches. More importantly, tests for Gaussianity are 
needed to discriminate among competing scenarios for the physics of the primor- 
dial epochs: here, the currently favored inflationary models predict (very close to) 
Gaussian CMB fluctuations, whereas other models yield different observational 
consequences [3, 28, 29]. Tests for non-Gaussianity are also powerful tools to de- 
tect systematic effects in the outcome of the experiments. For these reasons, many 
papers have focused on testing for non-Gaussianity on CMB, some of them by 
means of topological properties of Gaussian fields (e.g., [9, 13, 26, 36, 37]), others 
through spherical wavelets [2], and still others by harmonic space methods (e.g., 
[16, 20, 21, 24]). A short survey of the literature on testing for non-Gaussianity on 
CMB is in [23]. 

In this paper, we investigate the asymptotic properties for the observed bis- 
pectrum of spherical Gaussian fields, and we analyze its use as a probe of non- 
Gaussian features. The bispectrum (defined in Section 2) is probably the single 
most popular statistic to search for non-Gaussianity in CMB data; on one hand, in 
fact, working on harmonic space is extremely convenient, and the bispectrum is 
the simplest harmonic space statistic which is sensitive to non-Gaussian features. 
On the other hand, it is possible to derive analytically the behavior of the bispec- 
trum for non-Gaussian fields of physical interest. In fact, although the procedures 
considered in the present work are new (to the best of our knowledge), a number of 
insightful and important papers have already considered the bispectrum for CMB 
data analysis, for instance, [19—21, 30]. Few analytical results, however, have so far 
been produced on the statistical properties of these procedures. The reason for this 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 3 


can be partially explained as follows: as described in the next section, the bispec- 


trum is a function of some spherical harmonic coefficients aj,,’s; in the presence 


of an ideal experiment, the latter are easily derived from a map of CMB fluctu- 
ations by a harmonic transform performed on the observed data. However, it is 
important to stress that in realistic situations the aj,4's are observed with error, due 
to instrumental noise, gaps in the maps and many other sources (these problems 
have been described in [33]). Determining the properties of the procedures in re- 
alistic situations that take into account all the features of real life experiments is 
extremely difficult; most work in CMB is thus based on comparisons of estimates 
based on real data with the expected results of simulations under a particular null 
hypothesis. In this paper, we assume that the aj,,’s are observed without error; 
this is a simplifying assumption, which we adopt because it seems important to 
narrow the gap between data analysis practice and its mathematical foundations, 
at least in an idealized case. Future work, however, should be directed at relaxing 
this assumption. 

In Section 2 we define the bispectrum, taking particular care to discuss the con- 
ditions to ensure that it represents a rotationally invariant statistic. The asymptotic 
behavior of its higher-order moments is considered in Section 3: these results are 
derived under Gaussianity, but the general technique by which they are established 
(which adopts a formalism from graph theory) may have some independent in- 
terest under broader assumptions. Section 4 considers the effect of an unknown 
angular power spectrum, whereas Section 5 discusses statistical applications, with 
asymptotic results in Gaussian and non-Gaussian circumstances. The results we 
present in this section suggest that consistent tests of Gaussianity can exist even 
for random fields defined on a bounded domain, which 1s to some extent unex- 
pected. Section 6 discusses the rationale behind the approach presented and draws 
some conclusions; some technical results are collected in the Appendix. 


2. The angular bispectrum. The Fourier transform on the sphere is defined 
by the spherical harmonics, which can be written explicitly as 





21 4-10 — 
Yim (0, p) = = TIEN E me Pim(cos@)exptimg) form => 0, 
Yim (0,9) = (-D" Y? (9, 9) for m « 0, 


where the asterisk denotes complex conjugation and Pj,,(cos 0) denotes the asso- 
ciated Legendre polynomial of degree /, m, that is, 


qm 1 l 
Ps (x) = (-1)" (1 — yn Pilz), P(x) = ama =)", 


4 D. MARINUCCI 


A detailed discussion of the properties of the spherical harmonics can be found in 
([34], Chapter 5), or in [35]. For isotropic fields, the following spectral represen- 
tation holds in the mean-square sense (see also [1, 22, 38]): 


co 1 
(1) T(0,9) - 9, >> amYin(0. 9), 


l=1 m--—1 


where the triangular array {aim} represents a set of random coefficients, which can 
be obtained from T (0, o) through the inversion formula 


X T 
Alm «f | T (0, 9)Yj, (0, q) sin dé dq, 
mJ 
m=0,+1,...,41,1=1,2,.... 


(2) 


These coefficients are complex-valued, zero-mean and uncorrelated; hence, if 
T (0, q) is Gaussian, they have a complex Gaussian distribution, and they are inde- 
pendent over / and m > 0 [although ay -m = (—1)"aF. ], with variance E|ajs|^ = 
Cj, m — O0, X1, ..., +l. The index / is usually labeled a multipole and in principle 
it runs from 1 to infinity; each multipole corresponds approximately to an angu- 
lar resolution of 180/1?. In any realistic experiment, however, there is an upper 
limit (which we denote by L) on the multipoles we may observe, depending upon 
the resolution of the experiment and the presence of noise; L is reckoned to be of 
the order of 600/800 for WMAP and 2000/2500 for Planck. Strictly speaking. in 
CMB cosmology the spectral representation (1) is really only defined for / > 2; 
the so-called dipole / = 1 is in fact dominated by kinematic effects and thus it is 
removed from the data. 

The sequence (Cj) denotes the angular power spectrum: we shall always assume 
that C; is strictly positive, for all values of /. This condition is very mild; to draw an 
analogy with the theory of stationary random fields defined on R4, it is equivalent 
to the (very common) assumption that their spectral density is strictly positive at all 
frequencies. As discussed in Section 5, the condition is met by virtually all models 
of cosmological interest. A natural estimator for C; is 


"CN AT B 
(3) Cie m laim ^, 1 —1,2,..., 
m=—l 

which is clearly unbiased; see also [12]. As mentioned in the Introduction, if the 
field is Gaussian, the angular power spectrum completely identifies its dependence 
structure. For non-Gaussian fields, the dependence structure becomes much richer, 
and higher-order moments of the aj,,’s are of interest; this leads to the analysis of 
so-called higher-order angular power spectra. 

Generally speaking, the angular bispectrum can be viewed as the harmonic 
transform of the three-point angular correlation function, much as the angular 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 5 


power spectrum is the Legendre transform of the (two-point) angular correlation 
function. More precisely, write QQ; = (8,, 9,), for i = 1,2, 3; we have 


ET(Q231)T (822) T (25) 
(4) - mym?ma 
= » 2. Bph Yum (Q1)Ynm (Q2)Ynum, (23), 
li, d2, i3 =1 M1,M2,M3 
where the bispectrum Bppn ° is given by 
(5) B I ni = E (alim; alm; alm) 


Here, and in the sequel, the sums over m, run from —/; to /;, unless otherwise in- 
dicated. Both (4) and (5) are clearly equal to zero for zero-mean Gaussian fields. 
Moreover, the assumption that the CMB random field is statistically isotropic en- 
tails that the right- and left-hand sides of (4) should be left unaltered by a rotation 
of the coordinate system. Therefore Bri must take values ensuring that the 
three-point correlation function on the left-hand side of (4) remains unchanged if 
the three directions $2, $22 and {23 are rotated by the same angle. Careful choices 
of the orientations entail that the angular bispectrum of an isotropic field can be 
nonzero only if: 


(a) I1, l2 and l3 satisfy the triangle rule, l; < l, + Ig for all choices of i, j,k = 
12.3; 

(b) li +l. +15 = even and 

(c) my +m + ma —O. 
More generally, Hu [17] shows that a necessary and sufficient condition for 
Bri, ^ to represent the angular bispectrum of an isotropic random field is that 
there exist a real symmetric function of /;, l2, 13, which we denote byn, such that 
we have the identity 


mym2m3 ^ dmjm;mj . 
(6) B, bis M $i lols bibis ? 


bj14 is labeled the reduced bispectrum. In (6) we are using the Gaunt integral 
mmm 
Gi, DEC defined by 


m pn 
S bL S = Í [ Yim (6, 2) iom, (8, 9) Yam, (0, o) sin Ó dq dO 


ae) h Jit h E 


where the so-called “Wigner 3j symbols" appearing on the second line are defined 
in the Appendix. It can be shown that the Gaunt integral is identically equal to zero 
unless the conditions (a)-(c) are fulfilled. Often the dependence on m, m», ma, 


6 D. MARINUCCI 


which does not carry any physical information if the field is isotropic, 1s eliminated 
by focusing on the -— — bispectrum, defined by 


Ll 
lo l3 mm;m 
Bibl = 5 2 2 & m»? p) Brom l 
(T) m,-—-—lj my=—l} m= —l3 
Ql; +122 + DO 27 f(l. bh h 
A — m ——) Mo o 9 


where we have used (48) (see the Appendix). In practice, of course, the bispec- 
trum is not observable; its minimum mean-square error estimator is provided by 
Hu [17], 


a l 
Bihi = Y > Y & 2 s. faim: atm on.) 


m,z-—li m= mz=—l3 


The statistic Bis is called the (sample) angle averaged bispectrum; for any real- 
ization of the random field T, it is a real-valued scalar, which does not depend on 
the choice of the coordinate axes, and it is invariant with respect to permutation of 
its arguments /1, I, I5. 

Now note that, under Gaussianity, the distribution of ajs/ ce ? does not de- 
pend on any nuisance parameter. The bispectrum can hence be easily made model- 
independent; namely, we can focus on the normalized bispectrum, which we define 
by 


Bj hh 
(8) Ibis = (—1)ith+a)/2 1/2 l 


The factor (—1)“!+2+/3)/2 is usually not included in the definition of the normal- 
ized bispectrum; it corresponds, however, to the sign of Wigner's coefficients for 
mı = m? = m3 = 0, and thus it seems natural to include it to ensure that 77,77, and 
bi bi share the same parity [see (7)]. 

In practice, of course, Jp} is infeasible, and it is thus replaced by the statistic 


HL B 
by +lo--l3) /2 his . 
Tis = (1) 0 Hat) 123... 
Cj Cr Ci, 


see [12]. 


3. Higher-order moments of the angular bispectrum. In this section we 
shall investigate the behavior of the higher-order moments for the normalized bis- 
pectrum (8), under the assumption of Gaussianity. We assume without loss of gen- 
erality || < h € I5, and define 

T ; l i 1, for 14 <h < Is, 
Alhi = 1 + 87 + 8p -+ 38; — 9 for!) =h <lh orh <h =b, 
6, for I4 = h= 13; 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 7 


here and in the sequel, ^ denotes Kronecker's delta, that is, 5° = 1 for a = b, zero 
otherwise. 

Under Gaussianity, it is obvious that the expectation of all odd powers of Ij, 1,1, is 
zero. To analyze the behavior of even powers, we first recall that, for a multivariate 
zero-mean Gaussian vector (x1, ..., x24), we have the following diagram formula: 


(9) E(x, XxX X x)= » (Ex Xn) x e X (Exp, Xi); 


where the sum is over all the (2k)!/(k!2") different ways of grouping (x1,..., X2x) 
into pairs (e.g., [1], page 108). Even powers of Mi yield even powers of 
the ajm s, which have a complex Gaussian distribution, weighted by Wigner's 3j 
coefficients; we shall then need to use some arguments from graph theory, which 
is widely used in physics when handling Wigner's 3j coefficients (see [34], Chap- 
ter 11). 

Consider the Cartesian product 7 & J, where J, J are sets of positive integers 
of cardinality #(/) = P,#(J/) = Q; it is convenient to visualize these elements in a 
P x Q matrix with P rows labeled by i and Q columns labeled by j. A diagram y 
is any partition of the P x Q elements into pairs like {(i;, J1), (i2, J2)); these pairs 
are called the edges of the diagram. For our purposes, it is enough to consider 
diagrams with an even number of rows P; we label T (7, J) the family of these 


diagrams. It can be checked that, for given 7, J, there exist (P x Q — 1)!! different 


diagrams, each of them composed of (P x Q)/2 pairs; we recall that (2p — 1)!! der 


(2p —1) x 2p —3) x--- x 1 for p=1,2,.... We also note that if we identify 
each row i, with a vertex (or node), and view these vertices as linked together by 
the edges {(ix, Jk), (lp, Je)} = iķig, then it is possible to associate to each diagram 
a graph. As it is well known, a graph is an ordered pair (7, E), where 7 is non- 
empty (in our case the set of the rows of the diagram), and E is a set of unordered 
pairs of vertices (in our case, the pairs of rows that are linked in a diagram). We 
consider only graphs which are not directed, that is, (iji2) and (i211) identify the 
same edge; however, we do allow for repetitions of edges (two rows may be linked 
twice), in which case the term multigraph is more appropriate. In general, a graph 
carries less information than a diagram (the information on the “columns,” i.e., the 
second element jx, is neglected), but it is much easier to represent pictorially. We 
shall use some results on graphs below; with a slight abuse of notation, we denote 
the graph y with the same letter as the corresponding diagram. 
We say that: 


(a) A diagram has a flat edge if there is at least one pair {(i1, j1), (i2, j2)} such 
that i; = i2; we write y € 'r(/, J) for a diagram with at least a flat edge, and 
y € Ug(1, J) otherwise. A graph corresponding to a diagram with a flat edge in- 
cludes an edge ii, which arrives in the same vertex where it started; for these 
circumstances the term pseudograph is preferred by some authors (e.g., [11]). 


8 D. MARINUCCI 


(b) A diagram y € l'z(1, J) is connected if it is not possible to partition the 
i's into two sets A, B such that there are no edges with i; € A and i? € B. We 
write y € l'c(/, J) for connected diagrams, y € l'e«(1, J) otherwise. Obviously a 
diagram is connected if and only if the corresponding graph is connected, in the 
standard sense. 

(c) A diagram y € P'z(1, J) is paired if, considering any two sets of edges 
{(i1, j1), (i2, Ja)} and ((3, j3), (i4, Ja)); then i; = i3 implies i2 = i4; in words, the 
rows are completely coupled two by two. We write y € l'P(1, J) for paired dia- 
grams. 

It is obvious that for P > 2 a paired diagram cannot be connected. Note that if Q 
is odd, paired diagrams cannot have flat edges, so that the assumption y € l'z(4, J) 
becomes redundant. 

(d) We shall say a diagram has a k-loop if there exists a sequence of k edges 


3. Ji), Go. J.-C. Jk), Gr Jk 10] = Gio)... Gi 1) 


such that i, = iyi; we write y € [p@)(U, J) for diagrams with a k-loop and no 
loop of order smaller than k. 


Note that l'E (I, J) =z), J) (a flat edge is a 1-loop); also, we write 
Tory, J) s: Pel, NATL J) 


for connected diagrams with k-loops, and Pops J) for connected diagrams 
with no loops of order k or smaller. For instance, a connected diagram belongs 
to l'or 0, J) if there are neither flat edges nor two edges ((i1, J1), (i2, j2)} and 
(G5, J3), Ga, Ja)) such that i, = i4 and iz = i4; in words, there are no pairs of rows 
which are connected twice. 

A graph is Hamiltonian [written y € Ty (1. J)] [11] if it has a spanning cycle, 
that is, if there exists a loop which covers all the vertices without touching any 
of them (other than the first) twice. Two graphs Gi = (I, E1) and Gz = (15, E?) 
are isomorphic if there exists a one-to-one, onto mapping ¢: Iı — I? such that 
iji) € Ey => $(i1)ó (i2) € E». 

In many cases either 7 or J (or both) can be simply taken as the set of the first p 
or q natural numbers, that is, 7 = (1,..., p), J ={1,...,q}. Under such circum- 
stances, when confusion is possible we shall occasionally write I (7, q), l'(p, J) 
or l'ép, 4) for l'(J, J). Some examples of graphs are drawn in Figures 1—6. 

We have the following result. 


THEOREM 3.1. Foralll, <l € la we have 
(10) E If = Ail? 
moreover, for p —2,3,4, 


2 = 
(11) El, on, = (2p — DAD, + Odi). 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 9 


pl 
c 


T e 


FIG. 1. A multigraph for y € Up(6, 3). 


PROOF. Result (10) is known in the physics literature; see, for instance, [19]. 
For (11), we recall that 


by Th 
(12) Ea, = CC D'un Cj, B, 


a 
Mi j CT mo p» Wy 


hence, in view of (9), and because the spherical harmonic coefficients are (com- 
plex) Gaussian distributed, the following formula holds, for all /: 


d] m, 
(13) ATT nul- Y. 83h, 4,4), 
rel j=l V Ci, yer (,3) 
where we define 
(14) (y; li, 12,15) = l| Gp" me gi 
(Cu Ju) 5 Ju) EY 


For any diagram y, we can also write 


(15 Diy;h, œ= [rii y IG > MALO 


tél jzlm, --l, 


where 


Iri Y |l-X-x li1,...,ip} =; 


ref j—1 m, z—l, Wye] "mp3 


10 D. MARINUCCI 


to be more explicit, there are 3 x P summations to compute: for instance, when 
P — 2 we get 


| 3 l; | li l5 lh li h h 
i€c[n,2) j=l m,=—l, my, 1=—l; m, 2=—l2 m,37—l3 My1=—h m,427—l m,43——l5 
Furthermore, we also define 
DIU (4, 3); h, be, l3] 
>` Diysh, l.l] 


y€r(,3) 


-[nr x 


in words, D[-; l1, 12, [3] represents the component of the expected value that corre- 
sponds to a particular set of diagrams. Notice that 


EI? Y s Y "n CA. [| 
lill; 77 m, m,» m3 Ci 


mjj—-—lj  may3—-l Mi=] 


m a 3 me l3 E dl m; 
= Xo X d m ms) fq 


m,j—-—ii m2p,37—l3 islas] 


Io 5 
ó(y;li,l,l 
DT, Lo) X Bs bas) 


iel y cr (1,3) 





h h p 
l l 
a 2 e he 2.) 2, Vail.) 


mj-—l ^ m3p3—-l yel 2p,3) 
= DIT (2p, 3); h, b, Js). 


— 


Now 

DIP (2p, 3); h, la, I] = DU p (2p, 3); fi, l2, l3] 

+ DIT 2p, 3T ep, 3); hi, l2, bs]; 

our aim is to show that 
(16) DIP p(2p, 3); li, b, l3] = (2p — DUAP pn. 
(17) DIF (2p, 3)\Pp(2p, 3); l, b, 15] = O0’). 
To establish (16) and (17) we rely on Propositions 3.1 and 3.2, whose proofs are 
collected in the Appendix. D 


The next lemma relates to the "Gaussian" component of the expected value, that 
is, the diagrams that are paired. 


€ 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 11 


PROPOSITION 3.1. Foranyp€N,and I with cardinality (1) = 2p, we have 
DIV p (4,3); 4,0,1] = 2p — DUAT LL 


The proof that (17) = O (ly ly requires considerably more work; the next three 
lemmas refer to diagrams with loops of order 1, 2 and 3, respectively. 


LEMMA 3.1. For diagrams with a flat edge, y € V r(1, 3), we have 
D[y;h, lo, bb] =0. 


The next two results show how diagrams belonging to Terg (1, 3), Tero C, 3) 
can be "reduced"; namely, they show how the corresponding summands in the ex- 
pected value can be expressed in terms of smaller-order diagrams. Without loss 
of generality we can take #(/) > 4; indeed the case #(/) = 2 has been dealt with 
in the proof of Theorem 3.1, while we know that odd moments are identically 
equal to zero. Let y € l'cr3(4, 3) be a connected diagram with a 2-loop, and 
denote by i,,i2 the rows that are linked by two edges; in other words, the dia- 
gram includes both the edges [(i1, j1), (2, ])] and [(i1, 2), (i2, J2)]; in the sequel, 
jk, jų takes values in (1, 2, 3), for any integer k. Because the diagram is con- 
nected, there must exist also edges [(i1, j4), (i3, JA)] and [(i2, j4), (i4, /4)], where 
i3, i4 Æ i1, i2. We denote by yr, i.) the lower-order diagram which is obtained by 
deleting [(i1, j1), (i2, j,)] and [(é1, j2), (i2, 77)], and substituting [(i1, j3), (i3, J4)] 
and [(i2, ja), (i4, JA] with [(i3, j), (i4, JA. In graphical terms, YR(ij,2) is ob- 
tained by cutting the two nodes 1, i? and merging together the edges that departed 
from them to reach other vertices; y&(,;,) can itself belong to Fero — 2, 3) and 
the argument can be iterated (Figure 2). We note that these reductions need not be 
unique in general; any arbitrary choice of a suitable pair of nodes would not affect 
our argument, however. 


LEMMA 3.2. For y € Vcr (4, 3) and yra, 1.) as defined before, we have 


1 
D[y; l, b, I5] = — — Dl vray); li l2, ls]. 
2l, +1 


By definitions (14) and (15) D[y; 11,15, 13] can be nonzero only if L} = ly = 
=l jj» SO there is no notational ambiguity in Lemma 3.2. More explicitly, as- 
suming, for instance, that the edges [(1, 1), (2, 1)] and [(1, 2), (2, 2)] are present 
in y, then a factor (2/3 + 1)! will emerge from the reduction. 

We now focus on diagrams with a 3-loop. Let y € l'cr(3(1, 3) be a connected 
diagram with a 3-loop, and denote by i1, i2, i3 the rows that are linked by the loop; 
in other words, the diagram includes the three edges 


[G1, J1), G2, 72)]. [(2, J3), (i3, Ja)], [(i3, js), (1, Je)]. 


12 D. MARINUCCI 


FIG. 2. Ve l'or) (6, 3) and YR(, 2) 


Because the diagram 1s connected, there must exist also edges 


[Cii J7), 4, Jg). [(i2, Jo), (i5, J10)]. [G3, J11), Ge; 7127] 


where i4, i5, ig #11, i2, i3. We denote by yg(,,i,,,,) the lower-order diagram which 
is obtained by replacing i2, i3 with i; and then deleting all flat edges. More explic- 
itly, YR(i,i;,i4) 18 obtained by deleting 
[Gt, ji), G2, j2)], [(i2, J3), (i3, J4)], [(i3, js), G1, Jo) I, 

and substituting 

[(2, Jo), (is, J10)], [Gs, j11), Ge, J12)] 
with 

[G1, jo), Gs, fio], — EG, fit), e, 712)]. 
In graphical terms, we are merging three nodes into a single one (see Figure 3); 
again, YR(i,i,13) Can belong to Tcro) (4 — 2, 3) and the argument can be iterated. 

LEMMA 3.3. For y € Tcro (1,3) a connected diagram with a 3-loop and 

YR(j,17,i3) 4$ defined before, we have 


L Lh l 
Diy; l, lo, l3] ve Ph f I DlyRG, 45,3 ly , lo, b], 


where on the left-hand side we have used Wigner' s 6j coefficient, defined in the 
Appendix; hence 


Dly; h, 1,13] = O(l5  D[ygG, as i); hi l2, 3). 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 13 


FIG.3 y €Floray(6,3) and ygq, ui) 


The proofs of Lemmas 3.1, 3.2 and 3.3 can be obtained as an application of the 
graphical method described, for instance, in ([34], Chapters 11 and 12), and they 
are hence omitted for brevity's sake. 

The next proposition exploits the previous results to provide a bound (17) on 
the “non-Gaussian” part of the higher-order moments of the angular bispectrum. 


PROPOSITION 3.2. For all I such that #1) —4,6 or 8 and lj < l5 < la, we 
have 


DIT, 3AT 94,3); l, D, 13] = O (0j )). 


REMARK 3.1. Acareful inspection of the proofs of Propositions 3.1 and 3.2 
reveals that for /; < 12 < l3 we obtain the special case 





6 6 6 h h l 

Elf n =3 + = + ——_- + ——— 6{ : H 

(18) hehe p op] 2551 h b B 
i Ridge 
"Mese a 


a result that we shall exploit for statistical applications. 


REMARK 3.2. Itis natural to conjecture that a result analogous to Proposi- 
tion 3.2 will hold for all sets #7) = 2p, p € N. This conjecture is not simple to 
explore, however, because the proofs in the Appendix require some analytic prop- 
erties of the summations of (products of) Wigner's 3j coefficients, which have not 
been extended (to the best of our knowledge) to products of arbitrary order. 


14 D. MARINUCCI 


4. Unknown angular power spectrum. In this section we focus on the more 
realistic case where the angular power spectrum is unknown and estimated from 
the data; so we consider Typu rather than Ini. As before, under Gaussianity of 
the underlying field T (0, p) 


729p-—i 
El bis =0, Desi. is 
by a simple symmetry argument. Now note that 


G C ^"! 
jarol? + 3:L 121a? Jar? + Somat lam 
2Ian | ) 
jarol? + 325.1 21a |? 


- ei 


e E E. 
* (21 + Dio,- Eu) £ QI + DDir( 5. Dss 1); 


here C denotes equality in distribution and Dir(6o, ..., 05) a Dirichlet distribution 
with parameters (05, ..., 05). Define 


a E al 
i 2 m=0,1,... L 


m 
Ulm = —7, Ulm = , 
fC; ie; 


We have the following simple result. 


(19) 


THEOREM 4.1. Letl and p be positive integers, and define 
A 2/91 
i py= ——— $. 
EDI Ia 
Now for u and ü defined by (19), we have 
E n.o in -. fy «hy. B n] 
rr m e J MM a A, 


goumes gıumes g’ nmes qı times qj times 


* 
= E| wo.. uoun. unut -uf Ms tid. ui] 


qotimes q; times qi times qyüumes g; times 


x gll; qo- qi 4). 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 15 


PROOF. By symmetry arguments, it is easy to see that both sides are zero 
unless qo = 2 po (say) is even and q; = q, = p; (say), for i = 1,..., k. The uim are 
independent over different m's, and thus we have 


Euo.. uoun.. n df, ts tg +- uf 
— — M — MÀ — HMM — 


2potimes q; times qi times qk times q, times 


- 2po pi ,,,* pk y. 
= Eu; Euj (u)"! + Eug, (Ui) 


k 
Qpo—UDM[[ph for po > 0, 


i=] 


k 
[] pled: p), for po — 0, 


ii 
because 
d * 2 d 
uio = N(0,1) and Umum = |uim|^ = exp(1). 
Now write p = po +-+- + py, and note that ([18], page 233) 
^2 ^ ^ ^ ^ 
E (ij ûn p)" -âk GP) 

OL DP pepo. gpk 

= pitt Polo ^ US 
|. QI- D? r(-1/2 T(po- 1/2T (po 9-1): (px +1) 
X 2p t£ PY 3 p + 1/2) r(1/2) 
|. Qpo — Dlipi! x --- x pe! 
— Qi -D x: x QI -2p—1) 


k 
Qpo - D! [[ pg; p), — for po > 0, 


iel 


(21 + 1)? 


el 
Peed 


k 
| | piled p), for po = 0, 


i=] 


as claimed. C 


Some special cases are as follows: 


2 2 
pH pm d mci. 
C; C; 
a 2p — 1)! 
Ci Ql -- D x --- x Ql -2p — 1) 


16 D. MARINUCCI 


E | ldim |^ i .QI-D? TF-1/2 r(Q/2Tr(p41) 
C; 22 FT(-c-p-c1/)  rü/2 
p! 
— Qi -D0x-- x QI -2p— 1) 
and for p = pı + p2, pi, pR >Q 
z (RO (e^ O (+? T(-1/2 T(pi 4-1/2T (po +1) 
Ci Cj |. 2» T(-cp-41/2) r(1/2) 
B (2p1 — I! po! 
| (+1) x-- x (U +2p-— 1) 
(ao (i) | (+1? r(-c1/2) 
Ci Ci ~ 2» P+ p+ 1/2) 
D(0/2) (pi + DU (pa + 1) 
r(1/2) 
= pi! p?! 
(2+1) x- x (U +2p—1) 
By Theorem 4.1, it is possible to establish a simple relationship between the nor- 


malized bispectrum with known or unknown angular power spectrum. More pre- 
cisely, it is immediate to see that for 1; < h < 13 


(2u +1)”, 


GE, 


(2l +-1)?. 


3 

72 2 

Ely hls a Ely, I] g(h; p), 
Iz=] 


that is, for instance, 


T2 2 
(20) Eli. — Eli tol, = | 


and 


Á 2 2 Z 
4 4 
OD Eso Eis Nay) ar) 7 43) 


Also, for) =l < lh andl; < h — là 


72 2 72. A 
Elina = El; npg li 2p)8(3; p) Eljp,-—Elj pg pg(lsi 2p); 
finally, for lı = h = 15 —1 


A 2 
Ely; = Ely; g(5 3p), 


so that, for instance, 


9 4 
22 ER -s(1- 343) (1- iii) 
Ve d 21 4-3 UES 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 17 


It is interesting to note that 


“2p 


(23) EÍ?, <Er?, and lim Eb | 
lilly — Ly lal, l1 00 E 


for all choices of (1), I2, [4) for which the bispectrum is well defined. 


5. Some statistical applications. In this section we exploit the previous re- 
sults to derive the asymptotic convergence of some functionals of the bispectrum 
array. Because the expected bispectrum is identically zero for Gaussian fields, 
these functionals arise as natural candidates in the development of statistical tests 
for non-Gaussianity. 

More precisely, assume that the resolution of the experiment 1s such that it yields 
a maximum observable multipole equal to L. It is, in practice, infeasible to take 
into account all available bispectrum ordinates for the implementation of a statis- 
tical procedure: indeed, for current experiments these ordinates are of the order of 
L? ~ 105/10?, and the evaluation of all these statistics is beyond the power of the 
fastest supercomputers for the near future. It is therefore mandatory to consider 
only a subset of bispectrum ordinates for the test. There are, of course, several 
possible choices of configurations. We shall restrict our attention to two of them; 
precisely, for finite integers lọ > 2, K > 0 we shall consider the processes 


[Lr] 
I hy. Lu 
(G4) Jig KO) = luem l 
á y» PN VK +1 ey serre To VAI~u, Ll+u 
(Lr] 
l (I2: —u,ll--u — Al, Iu) 
(25) Jot. KT) = — lm — 
i VL/2 set na VK +1 > 2 ATP 


and 





(26) Jar i, Kk r) = JE Preset? > Tris I rotel: 


I-lg-K +1 
[Lr]—lo— 
1 1 Ju Cid d 
(27) JaL;g,Kk T) = — Ix UT od 
i VL T eee VK +1 = V2 


where [-] denotes the integer part of a real number; 0 < r < 1 and Ip is an (arbitrary 
but fixed) value which can be taken equal to 2, for instance, for cosmological 
applications (remember the dipole lọ = 1 is usually discarded from CMB data, as 
it is associated with kinematic effects mainly due to the motion of the Milky Way 
and the local group of galaxies). As usual, the sums are taken to be equal to zero 
when the index set is empty. K is a fixed pooling parameter: for K = 0 we obtain 


18 D. MARINUCCI 


the special cases 


Jik; r) = G3 Y lin 
0 pues 3 
28 l even,L 71g v6 
" Y dj -6 6) 
NL nf) = SS 
VLIZ eren ls ($42 
and 
[Lr]-o - 
Bae) = Te » Ijo,1,L-o; 
29) [lg41 
' A CS Toa 71 ITE 
Jap hT) = 2: 
Es l=lo+1 2 


The normalizing factors are chosen to ensure an asymptotic unit variance for all 
summands, by means of Theorem 3.1 and (18), (21) and (22). For instance, 


Var IT - ij 
: 2 
zs Var(f2 iin] = EM iius e (Elo Lus 





zn BEIM. 
(30) GH V 57 137 214.3) V2] Dlg 43 
6lo +9 ^y) i | 
= 1+ 0d” — Í 
Gera 0 (7) (+ 007) 
=2+  0(I7!). 


The two pairs of processes Jip to, K (r), Jaz:io,K (7) and Jar iy K (r), Jar; k (r) can 
be viewed as sorts of boundary cases for the possible configurations of multipoles. 
Indeed, although none of them has so far been considered in the literature (to the 
best of our knowledge), it seems natural to view JIL i, x (r), J2L;lọ, g (r) as propos- 
als very close to much of what has been done so far in CMB data analysis. More 
precisely, it has become very common to restrict attention to multipoles close to 
or on the "main diagonal” J; == l2 = l3 — I, under the unproved conjecture that 
the greatest part of the non-Gaussian signal should concentrate in that area. This 
same rationale motivates J£: (r), Jar; k (r); we shall show below, though, 
how such a choice can be very far from optimal in relevant cases. On the other 
hand, Jar i9, & (T), Jar, ig, k (r) rely on a sort of opposite strategy, that is, for a fixed 
ly we aim at maximizing the distance among multipoles, albeit preserving the tri- 
angle conditions /, <l, + l. There are several alternative procedures one may 
wish to consider, but those we mentioned lend themselves to a simple analysis, 
while highlighting some quite unexpected features of asymptotics for fixed-radius 
fields. 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 19 


THEOREM 5.1. As L — oo, for any fixed integers lọ > 0, K > 0, 


Q1) Jing, KO), Jar; KW), Jar; KO), Jari, KT) > Wr), O<r<l, 


where — denotes weak convergence in the Skorohod space D[0, 1] and W (r) de- 
notes standard Brownian motion. 


PROOF. The proofs for the processes Joy.) x (r), a — 1,...,4, are very sim- 
ilar; we give the details only for the most difficult case, namely Ja47.;,, x (r). Here 
the proof is made harder by the complicated structure of dependence; note indeed 
that the set of random coefficients {aim : 1 = lo, ..., 19 + K, m = —1l,..., I] belongs 
to each summand in (29). Denote by S5; the filtration generated by the triangular 


array {qj,-1,.--,41,1},/=1,2,..., and define 
Xo T MET E "E —lg4- K41ll-K-42 
' J/K +1 PE, AL ? ; i 
that is, 
[Lr]-log—K 
JA; K (P) = »: XT. 
Ilg--K 1 
Now we note first that 
I Eli su B-m} — 1 
E{X,1|81—m} = y Pi lou L-lgtul?i-m) — + 
a VK TA J/2L 
= 0, m 1, 


because for all 0 <1; «1 «I, «la, 
ry) L h hB L Lh hb 
Elis] —— 2 3 & m2 A (s m, m. } 
m ,m2,13 m^ mim’, I 2 3 


Glim, ĉl m’ Blom 4] A Ol m36] mj, 
Ci Cp Ch 


y 3 An lj l4 li lo l4 
my m ms3jXm, m» m, 


m, ,/D2,m3 m’ pmm 





Gl mil m! _ ( 815m; Gb m^ Gl ms m, 
x ——————- E 
Ci Ch Ci, 


Y y (o lo la Hh lo l3 Jm 
m ma m‘ —HnHio  —ma Ci, 


my, m. m»?,m3 


x 2 l latm |? = 
2 4-1 E: 


l 





20 D. MARINUCCI 


Equation (32) does not imply that the triangular array (Xj, 7];—5,3, . obeys a mar- 
tingale difference property, because the sequence X; z is not adapted to the filtra- 
tion 37. However, (32) proves that the pair sequences (X; z, 37)1—2,3,. . do satisfy 
a mixingale property [7, 25], that is, 


—o 
(33) [E(E{X1,1|91—-m})*]'? < aF for m > 1, 


—p 
m 
(34) [E Xnr — BAX 1 Iam] S en for m > 1, 


(35) for some c, c2, 6 >Q. 


Actually the left-hand sides of (33) and (34) are identically zero for m larger than 
lo + K, so that, for suitable choices of the constants c1, c2, the bounds on the right- 
hand sides hold for an arbitrarily large $. Note that (Xj 1 =1,2,..., L, L = 
1,2, ...) is a uniformly integrable set, because 77, ,. has finite fourth-order mo- 
ments which are uniformly bounded [Theorem 3.1 and (23)]. Also, it is readily 
seen that 

[Lr] 2 

z EX 
sup lim sup SE "^r « 00, lim max EX?, —0. 
O0<r;<m<l L= T2 = F] L—00[-4, .,L : 


We have thus established conditions (2.2), (2.3) and (2.5) in [25] for the functional 
central limit theorem to hold. To complete the proof, we only need to show that 


[Lt] 2 
jm re (32 Xie) an - 75 


For notational simplicity and without loss of generality, we consider the special 
case K — 0. We have 


«(5 r) be 


i=[Ls] 


[Lt] [Lt] | [Lt] 


= NS EX; Sur} +2 > > E|Xi,L Xy, L\ŠiLr}. 
l=[Ls] I-[Ls] =+] 


= () foranyr « s «t. 





Now for l, l’ > [Lr] we have 


E[{ Xi LXr ping] 


l ag T2 
=o" (Us.1,449 — 1) ir ras 7 DSL] 


1 22 72 
(36) = AL. Ui cag Tir rag tcr] 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 21 
22 22 
— Ellin Stun} Ege pag Str] 1] 


lien z 
= AL. Ui.) Tio, rag fur] — 1]. 
There are now two possible cases, namely / + lọ = F and 1 + lọ Æ l’; in the latter 


22 22 
Eje 1 Lg! trtni] 


i lo l {+o 
i 2, » bs mii 


m21 
MOI, M02 MOZ MA4 m 1,7 ]2,m»| M22 


x ( lo i bind 
mo, -—mjij -—m»;?| 
(37) «( lb r Pd || lb r bed 
mo3 mj» m» mo4 —mj) —m22 
x. Homo lomo lomos lomos 
Ci 


6 


—nmQ2 ç~ moy 
Ómoi Ümos lomo Fomor Tomas 4omos _ | 
2 A2 E 


the second to last step following from (49) in the Appendix. Otherwise, for 
l+ly=l, 


3 


“9 22 
Ej, taglio tr rap Bir) 


u Y Y ( lo l I’ ) 
i mo, mij mi 


ING] 77027103 104 1] 1 21 m?» 


«( lp l [l Y lg I’ pay 
mo», -—mjj —m21 mo3 mi, m», 


x ( lo l’ l + "i Glgmo; Clgmo? 4lomo3 4omo4 
= LG 5003 OTOA 


moy —mi2o ~mn Ch 


lo l I’ ) 
s 2 2 RÀ mii m21 


mo, F102 M7193 7104 M11 ,m5,m»; m») 


«( lg l H Y lg I’ - 
mo, mj» -—m?| mo3  —mii m» 


T ( lo 4 ded O[omo, lomoz lomoz C Tomo4 
moy =mi -m2 C? 


(38) 


—]-4-2B, 


cp 20826 


22 D. MARINUCCI 


where 


Borders, Urt m ma) (mor ma mn) 
mo ,mo5,mqg3,mo4 m 1,7 ]2,m»j,m22 HiOL MIr- 721 mo, mij, -m21 


x ( lo I’ l’ en ( lo i i Me Glomo lomoz Clgmgs C lomoa 
mo; —mij; mn JXmo, -mọ -mz C 


=> X tofi Rib I’ 4- lo 1 


s=00=—s oo sly 


l l $ l l S 
AO PPM Oe 
moi Poo, moz, my 01 02 nma 04 


y lomo; lomo lomoz lomos 
Ch 

in view of (54) in the Appendix. Now the triangle conditions entail that the sum- 

mands are nonzero only for s < 2lo; from (50) and (55) below we learn that 


























Lot mo «ns m o S DR 
moi moo 0/||Xmos moe oJ Aht 
h l a li l' -+ lo 1 - I Š l 
lo s lbj ilb s b WL CADAT 
whereas it is also trivial to note that 
2 
Algmoi lomo | zmar <en latom]? = (2lg + 1) max — iom | eoque. 
Ci ie (OS us Eu lai 
Hence 
(2lo + 1) I 
BIS, AU UAE s 22. pa 5 OR. 
5x2lo |o | 2s Td omoimoz,mos,mos “10 T 
(39) 
p (4lo + 1) lo + 1)' 
~ 21+ 1 


Combining (36), (37), (38) and (39), we have 


1 (4lo + 1)(2lp +1)’ yl 
L Fi eo 
[Lt] [Lt] 


log L (Ho + 1)(2lo + 1)’ 
D Y EXX) sA T 


|E {Xi LXr LIS} x 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 


P The remaining part of the proof is similar. More precisely, we have 


E[X; Ll3} 
1 
= SEU. ias 7 Huh 
Eg 1 Lug Sta) 


= > 3 (P l pd 


n m m 
mo1,m2,moa mo m 1,m12,m21,m22 01 11 2] 


- ( lo l ied 
mo, ~My, m2] 
«Lo l peg | l di 
» mo3 712 nm?) moa -mi2 ~m 
lomo; lomo 4omo3 lomos 
x —— A 
C2 
lo 
‘et » Y ( l l pd 


m n m 
mj ,mo2,mo3,mo4 m jj ,m 12,231 ,m22 01 11 21 
- ( lo l ing 
mo, -—mjj m», 


ds l n l dii) 
moy mi») -—m»j my ~mi? -m2 


x lomo lomoz Žlomoz C 1gmoa 


72 
C ^ 
— 3A + 6B. 
As before, 
MS —mq e—m04 
Ômo, Omos lomo; lomoz Alomo3 lomos 
m Qui o GE 
moj,mg2,mo3,mo4. N70 lo 
= 5 1 |Aigmo, ato aa 
= See ee em 
and [see (54)] 


> E l pag | l jud 


m m m m —m m 
m 1,7142,™M21,m2 01 11 21 02 11 22 


(A l iid e l dine) 
mo; mi» —n) mo4 —mi2 —m» 


23 


24 D. MARINUCCI 


$ 2 
u L llo Ip 
EY Donl : t 


s5x2lg O=--S 
" lo lg s lo lo s 
mog mo cJjXmo mo oj’ 


_ Alo + 1) Glo 4 1)! 
ies 2l P 1 ; 
as in (39). We have thus shown that, for some C > 0, 


[Lt] 2 
EI > Xi) Bun! stre) 


l=[Ls] 


whence 





[Lt] [Lt] [Lf 








2 
« © EBan -D4 © D Eranti 
I=[Ls] i=[Ls]/=i+1 
c| El (Alg-- 1 | 
< > ——— | log Eo 








log L 
< C(4lo + 18 == > 0 as L — oo. 
Thus the proof is complete. (1 


Theorem 5.1 can be immediately applied to derive the asymptotic distribution 
under the null of several non-Gaussianity tests. For instance, we might focus on 
SUpoz, «1 Jar; K (7); by the continuous mapping theorem we obtain 


lim P| sup hiwr) <*| = P| sup Wes] 
(40) L9» losr<l O<r<l 


= 20(x) —-1, x>0,a= 1,2,3,4, 


where ®(-) denotes the cumulative distribution function of a standard Gaussian 
variable (for the last equality, see, e.g., [5]). We shall now discuss the behavior 
of these procedures under some examples of non-Gaussian spherical fields. The 
behavior of higher-order angular power spectra under non-Gaussian alternatives is 
an extremely important research topic in modern cosmology, and still almost com- 
pletely open for mathematical research. Very few analytic results are available, 
whereas the cosmological debate is still open on the nature of the non-Gaussianity 
to be expected. A simple and popular model for non-Gaussian temperature fluctu- 
ations reads 


(41) Tuc(0, 9) — T(8, 9) + fNL(T^(8, p) — ET*(6, q)); 


e 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 25 


as before, we take T' (0, ø) to be an isotropic Gaussian field with zero mean. For 
“small” values of the nonlinearity parameter fyi, (41) can be viewed as a gen- 
eral approximation for random fields with minor departures from Gaussianity: the 
quadratic term can be regarded as the leading factor in a Taylor expansion of a gen- 
eral field g(T (0, g)), for a suitably regular function g(-). Equivalently, the terms 
on the left-hand side can be considered to be the first two elements of expansion of 
g(T (0, o)) into a series of orthogonal Hermite polynomials Hz (T), q = 1,2, .... 
For these reasons, (41) is very widely adopted to represent the primordial field of 
temperature fluctuations in cosmological models of inflations, which stand now as 
the leading models for the dynamics in the primordial epochs around the Big Bang 
[3, 19, 28]. It is known in the physics literature [19] that the bispectrum of (41) can 
be approximated by 


là h l 
(42) Bibh = GfNLŘh hi ( 5 s E [Ci Ci, + CL Cry + C1, C5]. 


where G is a positive constant, 


Ql, + DO + 013 + D4!72 
hips = n NE" CN , 


and lower-order terms are neglected. Expression (42) is known as the Sachs-Wolfe 
bispectrum; we shall take (42) as a benchmark model for our discussion of non- 
Gaussianity. We introduce a very mild regularity condition on the angular power 
spectrum, that is, we shall assume that C; is such that, for fixed lọ > 0, 


(43) Chitty X CI, 


where « denotes that the ratio of the right- and left-hand sides tends to a positive 
constant, as | — oo. The assumption described by (43) is not unreasonable: in- 
deed, C; x I * (for some positive constant a) for many, if not most, cosmological 
models. 

For simplicity, let us assume that the normalizing angular power spectrum is 
nonrandom, that is, known a priori; without loss of generality we take K = 0. 
Recall that ([34], equations 8.1.2.12 and 8.5.2.32) 


(5 h b) (— 1) rU T0, +b + l3)/2]! 
0 0 0/ [+h —1)/2]ti — D 4- 15)/2) [C71 +h 4-13) /2]! 
, [5 +h — I3) (Ga — D -- 15)! (1 +h +h)! B 

(I o- Io 4-15 4- 1)! 
Thus, for fixed lọ > 2, 
lo l L+) C DU lo) x x (4 1) 
00 0J-. lg! 
" 4 Qo)! 
JOUED Xxx OLD 


26 D. MARINUCCI 


(—1y« 1 
=O of ds) 


for some C > 0 which depends on lo but not on /. Then we have easily that 


[Lr] 
(21g + 1) 21 + 1) 21 + 21g + 1) 
EBL) « * 2 (a MEC 
L 41 "m 
(44) z 4 Coe! x [Cig Craig 4. [Citt 
VLIV Cru Ci Ch 











[Lr] CC [Lr] 
Y [vi lo x >a VIX NLL. 
<J l=Io+1 Cio «Tt 1-141 
Likewise 
1 [Lr] T j [Er] 
PREIA T 2 Ulis = Tp o onto) =l 
I=lọ+1 L l=1 
(45) 
fs y ] fs 72/2 
x — xX 
VL i 


Of course, both (44) and (45) diverge as the number of observed multipoles in- 
creases (L — oo), that 1s, as the resolution of the experiment improves. The 
constants of proportionality are typically small; for the model we adopt in the 
simulations below, they are of the order 107^ for (44) and 1078 for (45). 

On the other hand, for l = h — [5 = I, 








B I )= (C1 EA [1]? |^ 

000 [G/2)! | 31 4- 1)! 

uri [31+3/2 B (—1)3!/2 
io d 


Lage ef ee 
« (D Saga OD 00172) ] 





where we have used Stirling's formula n!/(A/2zxn"*1/?e-"^) = 1 + O(12n7}). 
Now recall that 





oo 
(21 4- 1) 
ET^(0,9) = C 
(8, p) 2. i; 0<% 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 2] 


which implies /C; = o(1~') (in realistic experiments, a weighting factor G; should 
be included to account for the beam pattern of the antenna). Hence we have 


jp = pd 
E Jig, C) = O (zÈ Vx (o 0 H : 3 


(46) T vc 


l [Lr] 
= Ol MiG) - om as L — oo. 
(zÈ, 


A similar heuristic argument can be used for Jo; i, x (r), suggesting that these 
statistics may have very little, if any, power against alternatives of the form (42). 

To validate the previous heuristics, we present a small Monte Carlo study on 
the power of these testing procedures. To this aim, we generated 200 spherical 
Gaussian fields according to the currently favored scenario for CMB fluctuations, 
the so-called ACDM model; we omit the details, which can be found, for instance, 
in [6]. Nonlinearities were then introduced by means of (41), which represents the 
simplest non-Gaussian model, as argued earlier. To ease comparisons with existing 
procedures, we follow the standard parametrization used in the astrophysical liter- 
ature, yielding a variance for T of the order of Var(T) = ET? ~ 10-9. We can then 
present a rough relationship between the value of the nonlinearity parameter fy. 
and the relative amount of the non-Gaussian signal, namely 


a T?) 
(47) AE = 4/2 fu. / Var(T] > fur x 107^. 


We consider iae = " 100, 300, 1000 and we focus on the statistics 
SUPo<r<1 J3L;lo K (T), SUPp<p<] J4L;lo x (r) for lo = Z and K = 0,2, 4; see (40). 
We omit reporting results for supo, -, Jir ig, Kk (7), Supgz, «4 J2L,1),K (r) because 
the power related to these procedures turned out to be negligible, as expected 
from (46). Other types of statistics may be considered without altering the main 
conclusions we are going to draw here. We start from the empirical sizes (type I 
errors), which are reported in Table 1; we write $4 7, for supp <, <1 Ja, r2. k (r), 
a = 3,4. We take L = 250, 500; these values are conservative: we recall that L 
is reckoned to be of the order of 600/800 for WMAP and 2000/2500 for Planck. 
Note that all values in Table 1 and those in Tables 3—5 are expressed as percent- 
apes (96). 

Results in Table 1 suggest that the asymptotic theory presented in Theorem 5.1 
provides a good approximation for the finite-sample behavior in the case of J3 ; ; 
the approximation is slightly less satisfactory for J4 7, but the results improve 
markedly with L, which is reassuring. Because the test statistics are free of nui- 
sance parameters, it is also possible to derive directly threshold values under the 
null of Gaussianity by Monte Carlo replications. We adopt both approaches to de- 
rive the power properties reported in Tables 3—5; the Monte Carlo critical values 
are reported in Table 2. 


28 


D. MARINUCCI 


TABLE 1 
Empirical sizes (fy, = 0), a = 1096 (5%) 


K=0 K =2 K=4 
$3.250 9.5 (4.5)  9.0(4.0 8.0 (2.5) 
S3500 10.5 (4.5) 9.5(4.5) 9.5 (4.0) 
54.250 5.5(3.5) 3.0(1.5)  1.0(0.5) 
54.500 65(4.5)  80(40) 60(2.0) 
TABLE 2 


Monte Carlo critical values, a = 10% (596) 


K =9 K =2 K =4 
S3250 1.61 (1.81) 1.55 (1.83) 1.61 (1.72) 
53,500 1.69 (1.90) 1.63 (1.92) 1.61 (1.80) 
S4250  144(171) 1.27 (1.51) 1.06 (1.35) 
54.500 1.56 (1.88) 1.52 (1.85) 1.38 (1.67) 
TABLE 3 


Rejection rates for fry, = 100, a = 1096 (5%) 





K zx 0(T) K=2(1) K =4(T) K =0 (A) K =2 (A) K =4 (ÀA) 
53,250 18.0 (10.5) 20.0 (12.0) 185(17.5) 175 (7.0) 16.5 (9.5) 17.5 (11.0) 
53,500 20.5 (13.5) 30.5 (20.0) 39.5 (33.0) 20.5 (12.5) 30.5 (20.0) 38.5 (26.0) 
54,250 9.0 (5 0) 10.5 (4.5) 10.0 (4 5) 5.5 (3.0) 3.0 (1.5) 1.0 (1.0) 
54.500 10.0 (5 0) 10.0 (6.0) 10.0 (6.0) 7.0 (4.0) 7.0 (4.0) 6.0 (2 0) 
TABLE 4 
Rejection rates for fur, = 300, a = 10% (5%) 
K =0 (T) K=2 (5) K =4(T) K =0 (A) K =2 (A) K =4 (A) 
53,250  31.0(260) 47.5 (38.5) 56.5 (515) 31.0 (21.5) 43.0 (33.0) 56.0 (41.5) 
S3500 47.5 (430) — 885(69.0) 88.5 (86.5) 48.5 (40) 88.5 (68.5) 885 (79.0) 
54,250 10.5 (5.0) 11.5 (5 0) 9.0 (5.5) 6.0 (3.0) 3.5 (1.5) 2.0 (0 5) 
54,500 12.0 (6.0) 11 5 (7.5) 11.5 (7.0) 9.0 (4.0) 9.5 (6.0) 7.5 (2.5) 
TABLE 5 
^ Rejection rates for fry, = 1000, a = 10% (5%) 
K =0(T) K =2(T) K = 4(T) K =0 (A) K =2 (A) K =4 (A) 
53.259 80.0(73.0) 99.5 (99.5) 100 (100) 80.0 (70.5) 99.5 (99.5) 100 (99.5) 
S3,500 98.0 (98.0) 100 (100) 100 (100) 98 0 (97.5) 100 (100) 100 (100) 
S4250  15.5(105)  26.0(18.5)  32.5(18.5) 12.0 (7.5) 10.5 (5.5) 8.5 (3.5) 
$4,500 37.5 (29.0)  57.5(505) 71.5 (58.5) 34.0 (24.0) 54.0 (47.0) 60.5 (47.5) 





v*- 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 29 


Note how the tabulated values approach the asymptotic results [i.e., 1.645, 1.96 
—see (40)] when L increases. To analyze the power of the test, we consider fy = 
100, 300, 1000; from (47) we can argue heuristically that these values correspond 
approximately to 1%, 3% and 10% of non-Gaussianity in the maps, respectively. 
In Tables 3-5, T and A denote the power with respect to tabulated and asymptotic 
critical values, respectively. 

At first sight, the results reported seem quite encouraging, as compared, for in- 
stance, to existing methods such as the empirical process, wavelets and local cur- 
vature: see, for instance, [6] for numerical simulations on the performance of these 
procedures. We stress once again, however, that this comparison could be to some 
extent misleading, insofar as in this paper we are sticking to some simplifying as- 
sumptions that are unrealistic for CMB experiments (namely, the absence of gaps 
and noise in the observed maps). The results we report, however, certainly suggest 
that further investigation of these procedures under more realistic circumstances is 
worth pursuing. We also note that the statistics based on J37;.;,, x (r) substantially 
outperform those based on J41-1),x (7) for this range of parameter values; indeed, 
for the latter the power is nonnegligible only when fwr, = 1000, that is, when the 
level of non-Gaussianity in the map is approximately 1095. However, a compar- 
ison of (44) and (45) suggests that the power for J47.1,, k(r) may improve more 
rapidly as the resolution of the experiments grows; this is an important factor to 
keep in mind, as it is expected that the satellite Planck will achieve observations 
with L of the order of 2500. Also, because J41-i), x (r) does not depend on the sign 
of the sample bispectrum, it can be expected to be more robust against other types 
of non-Gaussian behavior. It is also worth noting that the power of the tests grows 
with the pooling parameter K; for values larger than 6—8, however, the effect is 
nearly negligible and the computational cost becomes prohibitive. 

In our opinion, there are two remarkable features that emerge from this sec- 
tion. Equations (44) and (45) suggest that a testing procedure in harmonic space 
can yield consistent tests of Gaussianity even for fixed-radius, nonergodic fields, 
at least under the simplifying assumption of this paper. This is to some extent an 
unexpected result. The second remarkable fact is the huge impact of the choice of 
combined angular scales on the expected power under non-Gaussian alternatives. 
It is noteworthy that the common choice of a (close to) “main diagonal” configu- 
ration can yield negligible power, the expected value of the non-Gaussian signal 
decreasing to zero as the resolution of the experiment improves. The determination 
of the triples of angular scales (/;, /2, /3) where the largest part of the non-Gaussian 
signal is to be expected, for a given class of models, represents an issue of great 
importance for future cosmological data analysis. 


6. Comments and conclusion. It is important to make clear that the asymp- 
totic theory presented in this paper is of a rather different nature with respect to 
what is usually undertaken for random processes or fields. More precisely, we are 


30 D. MARINUCCI 


not assuming that the information grows in the sense that a larger interval or re- 
gion of observations becomes available, but rather we assume that the same region 
(spherical surface, in our case) is observed with greater and greater resolution. We 
labeled this framework high-resolution asymptotics, whereas the more standard 
case where the observed volume grows can be termed, as usual, large-sample as- 
ymptotics. The idea that some consistent inference can be drawn from a process 
observed on a fixed finite region is certainly not new; see, for instance, [32] for 
a fixed-domain asymptotics approach to study optimal linear prediction of spatial 
processes (kriging). The term infill asymptotics is also used in the statistical litera- 
ture with the same meaning. We believe that this paradigm will become more and 
more fruitful in the years to come, with many possible contexts of applications. 
In the case of cosmological research, a proper understanding of the nature of the 
asymptotic theory involved is likely to be quite relevant from both the theoretical 
and the practical point of view. In fact, note that sequential experiments to measure 
CMB radiation result in exactly the same last scattering surface being measured, 
while the resolution improves steadily over time. For instance, the above men- 
tioned NASA satellite WMAP, launched in 2001, is observing the same surface 
as the ESA mission Planck, due to be launched in 2007: in terms of the standard 
large-sample asymptotics, no improvement should be expected. On the other hand, 
for statistical properties that can be consistently investigated as the resolution of 
the experiment grows, Planck does offer substantial new information, its expected 
resolution outperforming WMAP by a factor 3 or 4 (in any case, Planck will of- 
fer other improvements besides better angular resolution, e.g., better polarization 
measurements). It seems therefore quite relevant to suggest, as we did in this paper, 
that Gaussianity tests may exist that are high-resolution consistent, at least under 
some simplifying assumptions. The fact that consistent inferences can be drawn 
from nonergodic random fields defined on a bounded domain has other important 
consequences if we focus on epistemological issues. The status of cosmology as 
a science is occasionally questioned, on the grounds that it is, in a sense, a disci- 
pline based by definition on a single observation (our Universe). The possibility 
to draw consistent inferences for fixed-radius random fields provides, in our view, 
a strong argument to consider the corresponding physical properties fully within 
the domain of scientific investigation. It is a challenging task to characterize, un- 
der general conditions, the complete set of properties on which high-resolution 
consistent inferences can be drawn. 


APPENDIX 


A.1. The properties of Wigner’s coefficients. In this paper, we make ex- 
tensive use of Wigner's 3j coefficients, which are a very powerful tool to repre- 
sent properties of random fields which are invariant to rotations. These coefficients 
were introduced in the framework of the quantum theory of angular momenta; they 
are also widely used in algebra in the framework of representation theory. In this 
section we shall recall some of their properties for convenience. 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 31 


Wigner's coefficients are defined implicitly by 
X p2x 
Í [ Yim, (8, 9) Ynm; O, Q) Yi4m4 (0, P) sin ð dg d 


qur ly j lh. d» di 
E 4r 0 0 O/\m, mo mj 


Many explicit representations are also available, but for general values of 4, m, 
they are lengthy and hardly informative. For instance, it can be shown that ([34], 


expression 8.2.1.5) 


L lh h 
mij m» m3 


= (7 phim] C Td —13)— 5-5) — h2 + 2 


(l4 +h 4-13 +1)! 


(I3 + m3)!(l3 — m3)! ne 
La T mi) — mi), + m3)! - 
x Y (—1)* (2 +h +m, — z) 1 — mi +z)! 

zl(lo +13 — ly — z)!(l + m3 — 2) — l2 — ma + z)!’ 


Z 
where the summation runs over all z's such that the factorials are nonnegative. 


We list here some important properties, and refer to [34] for proofs and further 


discussion: 
(a) Wigner's 3j coefficients are real valued; 
(b) they are different from zero only if mı + m2 + m3 = 0; 
(c) (parity) for any triple /;, 15, I3 
( h b B ) = (- 1yrHets ( 
mi m ma —m; -—m» —ma3 
(d) (symmetry) for any triple J), l2, l3 


bh L bh\) (lb lh hy fb h hk 
m mo m] \m m mij] \m m m 


= (—1y1*55 ( l3 lp lj 
m3 m»; my 

E (—1) tath ( li l3 h ) 
mi, m3 mo 


= (s-1) Er $6 lo l l3 A 
m2 m ma4j 


(e) (orthonormality) for any triple /1, J, l3 
i i I 


mys—ly m=i; m3 =—l3 


32 D. MARINUCCI 
and 


li l2 L/ M' 
hob L\(h hb LN pog. 
oa 22 2 iz m2 A 2 m2 w) |. 2L 41 


my=—l m?-—L 





(f) (upper bound) for any /), lo, L3 


hb k bg \ Ei 
(50) (rey) = Ollmaxth b. 1775 
(gl) (sum of coefficients, I) for any positive integers a, b 
- a a b a 0.0. 
(51) ZoD — ^ aya VIa + 18989; 


(g2) (sums of coefficients, II) for any positive integers a, b, c, d, e and f 


p pegstens d db d cu 


a, b, y £,6,0 


(52) «(i d An b 4 
a 6 —@/\y Pp ¢ 
Ja b e|. 
"aj 
(g3) (sums of coefficients, III) for any positive integers a, b, c, d, e and f (see 
[34], 8.7.3.12) 


P aad M din : 2 : 5) 


ute | b ra 


(g4) (sums of coefficients, IV) for any positive integers a, b, c, d, e and f (see 
[34], 8.7.4.20) 
EU b NG f AT b dim T : 
gv QU" BF NITRO A EP Pe Sp y 
(54) = (= ero tes 


as j g s d b c a||b e g 
x o 3d o 5)U s Ale s Al 
Equation (52) can be used as the definition of Wigner’s 6j coefficient, which ap- 
pears on the right-hand side (with curly brackets). We refer again to ([34], Chap- 
ter 9) for some (extremely complicated) explicit expression for these coefficients 


and many of their properties. For our purposes it is sufficient to recall the follow- 
ing: 


(53) 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 33 
(h) for any positive integers a, b, c, d, e and f 


|a « EEES 
d e f]^ Qc DQf c1) Vat DOd 3-1) 








(55) i ) 
J/Qb -1)Qe +D,” 


(i) the 67 coefficient is invariant under any permutation of its columns. 
A.2. Proofs of technical lemmas. 


PROOF OF PROPOSITION 3.1. Let {2;,..., Ip} be a partition of J into 2 x 3 
matrices made up with two of its rows; write £ for the class of these partitions, 
which has cardinality (2p — 1)!! (the number of combinations by which we can 
t match 2p rows two by two to form p pairs). It suffices to notice that 


P 
DITpQ,35h,b,l]-  »,  [|[DTrt, 3); h, l, b]. 
i, .,dp)efi1 


For any 4, DIT e(l, 3); Ij, 12,13] is easily seen to include A;n nonzero sum- 
mands, each of them of the same form, up to a relabeling of the indexes; more 
precisely, 


DiV pC, 3); l, l2, 43] 


l l l 
= Alb »3 eR l 2 3 ) 


m m m 
m,],m,3,m,3 jd 12 3 


m hob BY 
= Albis >, = Ahhh: 


m; m; m 
m, ,m,3,,3 il i2 3 


Thus 


p 
DIV p (I, 3); h, lo, 3] = 25 I] Anl, = (2p — DHAT ph. 
(4, a p}eli=l 


as claimed. L] 


PROOF OF PROPOSITION 3.2. (a) The case #(/) = 4. This result can be es- 
tablished as a straightforward application of Lemmas 3.1—3.3. 


uv 


34 D. MARINUCCI 


(b) The case #(/) = 6. Without loss of generality we can take 7 = {1,2,..., 6}, 
and show that 
Eb r(,! 2 b) 
Mil m,» mij3 


mjjz—lj — mgaz-—l3i1—1 


8(y; li 13) = 01). 
y €T (6, AP p (6,3) 

It 1s sufficient to consider only the diagrams with no flat edges; it is readily seen 
that the latter can be partitioned into (a) the unconnected (unpaired) diagrams and 
(b) the connected diagrams. Now for (a) we note that, because there cannot be 
any flat edge, if the diagram is unconnected but not paired the set of rows must 
necessarily be partitioned into a group of two and a group of four; in other words, 
after some rearrangement of indices it must be possible to write any diagram y 
belonging to (a) as y = yı Uy, where yı € L'»(2, 3) and yz € l'c(4, 3). Thus, 
exactly as shown above, we obtain that the corresponding terms are bounded by 
C (21 4- 1)7!. It suffices then to look at the connected diagrams. Consider first y € 
l'croy(6, 3); from Lemma 3.2 and simple manipulations we obtain immediately 


(56) D|Tcroy(6, 3)] = O (Ij; ! D[Pc(4, 3)]) = 00; ^. 


In case y € l'eyo5(6, 3), it can be readily shown that we must have y € 
{Tero (6, 3) U Tera (6, 3)); in other words, these diagrams have at least a loop 
of order either 3 or 4 [U cra (6, 3) is empty]. The latter claim can be established 
as follows. We have a graph with six vertices, each of which is of degree 3. Hence 
we can argue as in ([11], page 48) to show that the closure of the graph is com- 
plete (it is a clique). Then, from Theorem 4.5 of ([11], page 48) it follows that 
the graph is Hamiltonian, that is, it contains a spanning cycle. With a permuta- 
tion of the indices, we can then take the vertices to be at the corners of a regular 
hexagon, and we are left with an edge free for each of them. These edges will con- 
nect different vertices to form loops of order 3 or 4. The graphs corresponding to 
y € Fero) (6, 3) with at least a loop of order 3 are labeled type A; those with all 
loops of order at least 4 [y € l'cr(4)(6, 3)] are labeled type B (see Figure 4). 

Let y correspond to a type A graph; by Lemma 3.3 we easily have (see Figure 3) 


h h b 
Diyih h ii=0(|(} 2 p) PEG: h bh]) 


= h b BIN 442 
*o(in n g} )7 95^ 
For type B, again up to a permutation of the indices we can take, with no loss of 
generality, 
mj; = M21, mi2 = ma, mi3 = —m63, 


M22 = —mM32, m23 = —ms3, m3, = —mé|, 


m33 = —ma43, maj = —m35], ms» = —m62. 


+- 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 22 


C S 


Fie 4. y €Tl'cr)(6,3), y € Pera) 6, 3). 


The corresponding expected value is premultiplied by the factor 


(— 1) 1bmiztmismsi ms? rmssemsi- Ems? mss _. 7 
? 


whence we have 


ee, a 


z],3,5]-17 
x ( ly l l3 Y li ly l3 ) 
—mijj —m32 —ms53 m31 mia» M33 
(57) «( ly l2 l3 )( h k 1m ) 
—ms; —-Mj2 —m33 ms; ms» M53 


«( li ly l3 ) 
—m3a; —ms2 —m43 


-Xewesv[t5 5) s)[o b b) 


bh xIllb x bh 


where we have used ([34], equation 10.2.3.17, page 339, and equation 10.2.4.20, 
page 340). In view of (55) we easily obtain 





(57) < xi = 0d; "Lr ^r^ gp) = 007’). 


3m l5 


36 D. MARINUCCI 


(c) The case #(/) = 8. Again, we can take J = (1,2,...,8] and show that 
h 


Soa S n(; bb) 
m, m,» mis 


mip=—ly mMg3=—l5 11 


x 2, bil, bb) = OG’). 
y e(P (8,3 AT p(8,3)] 


As before, it is readily seen that the diagrams with no flat edges can be partitioned 
into (a) the unconnected unpaired diagrams and (b) the connected diagrams. Now 
for (a) we note that 


Y €[(Ue ($5, 3 AT pe, ) N PR(8, 3)] 
=> y € [Te 3) ® Teh, 3) U cs, 3) @ Tc, 3)) 
U (Fc(5,3) 8 Tels, 3) 8 reih, 3), 


where (J, /2), (I5, I4), Us, Io, I7) are partitions of {1,2,...,8} into disjoint sets 
such that 


$(I1) — 6, 1(12) = 2, 
f(I3) = 4, f (I4) = 4, 
ft(I5) = 4, f (I) = 2, #(17) = 2; 


by ri @ T5, we denote the set of all diagrams of the form y = yı U y2, for 
yi ET; and y € I». For any two families of diagrams I"; (71,3), D2(/5, 3) such 
that 1; N Jp = Ø, itis readily checked that 


DID 8 T3; h1, 4,6] = O(D[T1; 4,4.) x DIT2; ho, l3], 
DIT; U T3; lh, 5,1] = OCDIT T h, 5, I5] + DIT2; hh, hb, bs): 
thus 
DTS, 3AT p (8, 3); 1, l2, l3] 
= O(DII'c(6, 3) 8 'c(2, 3); hi, l2, 13]) 
+ O(D[Fc(4, 3) 8 Tc(4, 3); l, l2, i5]) 
+ O(D[Tc(4, 3) 8Tc 2,3) 8T cQ,3); l, D, I3]) 
= O(DI[Tc(6,3); l, b, 13]) 
-+ O(D^[T c (4, 3); Uy, lo, 13]) + O(DIT C (4, 3); hh, l2, 13]) 
= O(1;'), 


as shown previously. It suffices then to look at the connected diagrams. Diagrams 
with a 2-loop can be handled by Lemma 3.2, and then by using results on lower- 
order diagrams. So we just have to consider y € l'erc;($8, 3); these diagrams must 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 37 


E DEN 


FIG 5. lsomorphic type C graphs. 


have at least a loop of order 3 or 4 [in other words, I cza (8, 3) is empty]. The latter 
claim can be established in the same manner as before; we repeat the argument for 
completeness. We have a graph with eight vertices, each of which is of degree 3. 
Hence we can argue as in ([11], page 48) to show that the closure of the graph is 
complete. Then, from Theorem 4.5 of [11] it follows that the graph is Hamiltonian, 
that is, it contains a spanning cycle. With a permutation of the indices, we can then 
take the vertices to be at the corners of a regular octagon, and we are left with 
an edge free for each of them. These edges will connect different vertices to form 
loops of order 3, 4 or 5. Graphs with loops of order 3 can be dealt with as before 
through Lemma 3.3. If all loops are of even order, then the graph is bipartite ([11], 
Theorem 2.4, page 23); we label it a type C graph, for which we provide two 
isomorphic representations in Figure 5. 


If there is at least a loop of order 5, we label it a type D graph, for which two 
isomorphic representations are provided in Figure 6. 

To analyze the behavior of the components D[y, l1, /2, 13] corresponding to 
type C graphs, we can take with no loss of generality 


my) = —mjj, m? = —m42,; mj3-— —mg3, M32 = m», 
m3] = —nma4,, m33 = —mga, m51 = —mgj, ms52 = —msgj, 


m53 = —ma435, m7) = —mgi, m72 = mg», MNR = —mM23, 


38 D. MARINUCCI 


FIG. 6. Isomorphic type D graphs. 
which leads to 
TT IY (i 5. uh Y l b ls ) 
«( li I> la Y li lo ls ) 
m3; m32 m33/ \—m3, —mio —ms3 
x ( li ly l4 ) 
msy m52 ms 
(58) «( h h ls Y h h k ) 
—ms; -mn —ma,/jXm;j mn mnz 


«( li In l3 ) 
—mj7j --ms» -—mi3 


c sies: hb h x|[fb h x 
eo Ler+yi; boatle nk 
3 h h x|[l3 h x 
L h hlib b hb’ 
where we have used ([34], equations 10.13.3.23 and 10.13.3.25, page 367); the 


sum runs over all positive integers x which satisfy the triangle inequalities l2 —1, < 
x <hl +l. By using (55), we obtain 


5 lo Hood lz iy x l4 h x ls l x 
SIUS COM pM) max [7 l3 nite ly ate 5 n D 4 


= oq; rM, 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 39 


In view of ([34], equations 10.13.1.1 and 10.13.1.3, page 361) the proof can be 
completed by an analogous argument for components corresponding to type D 


graphs. [] 


Acknowledgments. Iam grateful to an Associate Editor and three anonymous 
referees for valuable comments that greatly improved the presentation of the paper; 
I am also grateful to P. Cabella for carrying out the simulations in Section 5. 


REFERENCES 


[1] ADLER, R. J. (1981). The Geometry of Random Fields. Wiley, New York. MR0611857 
2] AGHANIM, N , KUNZ, M., CASTRO, P. G. and FORNI, O. (2003). Non-Gaussianity: Com- 


2] 
[4] 
[5] 
[6] 


panng wavelet and Fourier based methods. Astronomy and Astrophysics 406 797-816. 
Available at arxiv.org as astro-ph/0301220. 

BARTOLO, N., MATARRESE, S. and RIOTTO, A. (2002). Non-Gaussianity from inflation. 
Phys. Rev. D 65 103505. Available at arxiv.org as hep-ph/0112261. 

BENNETT, C. L. et al (2003). First year Wilkinson Microwave Antsotropy Probe (WMAP) 
observations: Preliminary maps and basic results. Astrophysical J. Suppl. Ser. 148 1-27. 

BILLINGSLEY, P. (1968). Convergence of Probability Measures. Wiley, New York. 
MR0233396 

CABELLA, P., HANSEN, F. K., MARINUCCI, D., PAGANO, D. and VITTORIO, N. (2004). 
Search for non-Gaussianity ın pixel, harmonic, and wavelet space: Compared and com- 
bined. Phys. Rev. D 69 063007. 

DAVIDSON, J. (1994). Stochastic Limit Theory. Oxford Univ. Press. MR1430804 

DE BERNARDIS, P. et al. (2000). A fiat Universe from high resolution maps of the cosmic 
microwave background radiation. Nature 404 955—059. 

DORE, O., COLOMBI, S. and BOUCHET, F R. (2003). Probing cosmic microwave background 
non-Gaussianity using local curvature. Monthly Notices Roy. Astronom. Soc. 344 905— 
916. Available at arxiv.org as astro-ph/0202135. 

ERIKSEN, H. K., HANSEN, F. K., BANDAY, A. J , GORSKI, K. M. and LILJE, P. B. (2004). 
Asymmetnies in the cosmic microwave background anisotropy field. Astrophysical J. 605 
14—20. 

FOULDS, L. R. (1992). Graph Theory Applications Springer, Berlin MR1312607 

GENOVESE, C., MILLER, C., NICHOL, R., ARJUNWADKAR, M. and WASSERMAN, L. 
(2004) Nonparametric inference for the cosmic microwave background. Statist. Sci. 19 
308—321. MR2146946 

GOTT, J. R., PARK, C., JUSZKIEWICZ, R., BIES, W E., BENNETT, D. P., BOUCHET, F. 
R. and STEBBINS, A (1990). Topology of microwave background fluctuations—theory. 
Astrophysical J. 352 1—14. 

HANANY, S. et al. (2000). MAXIMA-1: A measurement of the cosmic microwave background 
anisotropy on angular scales of 10/—5?. Astrophysical J Letters 545 L5-L9. 

HANSEN, F. K., CABELLA, P., MARINUCCI, D and VITTORIO, N. (2004) Asymmetries in 
the local curvature of the Wilkinson Microwave Anisotropy Probe data. Astrophysical J. 
Letters 607 L67-L70. Available at arxiv.org as astro-ph/0402396. 

HANSEN, F. K., MARINUCCI, D. and VITTORIO, N. (2003). The extended empirical process 
test for non-Gaussianity in the CMB, with an application to non-Gaussian inflationary 
models. Phys. Rev. D 67 123004 Available at arxiv.org as astro-ph/0302202. 

Hu, W. (2001). The angular trispectrum of the CMB. Phys. Rev. D 64 083005. Available at 
arxiv.org as astro-ph/0105117. 


40 D. MARINUCC] 


[18] JOHNSON, N L. and Kotz, S (1972) Distributions in Statistics: Continuous Multivariate 
Distributions. Wiley, New York. MR0418337 

[19] KOMATSU, E. and SPERGEL, D. N. (2001). Acoustic signatures in the primary microwave 
background bispectrum. Phys. Rev D 63 063002. Available at arxiv.org as astro- 
ph/0005036. 

[20] KOMATSU, E , WANDELT, B. D., SPERGEL, D. N., BANDAY, A. J. and GORSKI, K. M. 
(2002). Measurement of the cosmic microwave background bispectrum on the COBE 
DMR Sky Maps. Astrophysical J. 566 19-29. Available at arxiv org as astro-pb/0107605. 

[21] KOMATSU, E. et al. (2003). First year Wilkinson Microwave Anisotropy Probe (WMAP) ob- 
servations. Tests of Gaussianity. Astrophysical J. Suppl. Ser. 148 119-134. Available at 
arxiv org as astro-ph/0302223. 

[22] LEONENKO, N (1999). Limit Theorems for Random Fields with Singular Spectrum. Kluwer, 
Dordrecht. MR1687092 

[23] MARINUCCI, D. (2004). Testing for non-Gaussianity on cosmic microwave background radia- 
tion: A review. Statist. Sct. 19 204—307. MR2140543 

[24] MARINUCCI, D. and PICCIONI, M. (2004) The empirical process on Gaussian spherical har- 
monics. Ann. Statist. 32 1261-1288. MR2065205 

[25] MCLEIsH, D. L. (1977) On the invariance pnnciple for nonstationary mixingales. Ann. 
Probab. 5 616-621. MR0445583 

[26] Novikov, D , SCHMALZING, J. and MUKHANOV, V. F. (2000). On non-Gaussianity in the 
cosmic microwave background. Astronomy and Astrophysics 364 17-25. Available at 
arxiv org as astro-ph/0006097. 

[27] PARK, C.-G. (2004). Non-Gaussian signatures in the temperature fluctuation observed by the 
Wilkinson Microwave Anisotropy Probe. Monthly Notices Roy. Astronom. Soc. 349 313— 
320. 

[28] PEACOCK, J. A. (1999). Cosmological Physics Cambridge Univ. Press 

[29] PEBBLES, P. J. E. (1993). Principles of Physical Cosmology. Princeton Univ. Press. 
MR1216520 

[30] PHILLIPS, N. G. and KOGUT, A. (2000). Statistical power, the bispectrum and the search for 
non-Gaussianity in the cosmic microwave background anisotropy. Astrophysical J. 548 
540—549. Available at arxiv org as astro-ph/0010333. 

[31] SMOOT, G. F. et al. (1992). Structure 1n the COBE differential microwave radiometer first-year 
maps. Astrophysical J. Letters 396 LI-L5. 

[32] STEIN, M. L (1999). Interpolation of Spatial Data. Some Theory for Kriging. Springer, Berlin. 
MR1697409 

[33] TENORIO, L , STARK, P. B. and LINEWEAVER, C. H. (1999). Bigger uncertainties and the 
Big Bang. Inverse Problems 15 329—341. 

[34] VARSHALOVICH, D. A., MOSKALEV, A. N. and KHERSONSKII, V K (1988). Quantum 
Theory of Angular Momentum. World Scientific, Singapore. MR1022665 

[35] VILENKIN, N. J. and KLIMYK, A. U. (1991). Representation of Lie Groups and Special Func- 
tions 1. Kluwer, Dordrecht. MR1143783 

[36] WORSLEY, K. J. (1995). Boundary corrections for the expected Euler characteristic of excur- 
sion sets of random fields, with an application to astrophysics. Adv. in Appl. Probab. 27 
943-959. MR1358902 

[37] WORSLEY, K J. (1995). Estimating the number of peaks in a random field using the Hadwiger 
characteristic of excursion sets, with an application to medical images Ann. Statist. 23 
640—669. MR1332586 


ASYMPTOTICS FOR THE ANGULAR BISPECTRUM 4] 


[38] YAGLOM, A. M. (1987) Correlation Theory of Stationary and Related Random Functions 1. 


Basic Results. Springer, New York. MR0893393 


DIPARTIMENTO DI MATEMATICA 
UNIVERSITA DI ROMA "TOR VERGATA" 
VIA DELLA RICERCA SCIENTIHCA 1 
00133 ROMA 

ITALY 

E-MAIL: marinucc Q mat uniroma2 it 


The Annals of Statistics 

2006, Vol 34, No 1, 42-77 

DOI 10 1214/009053605000000868 

© Institute of Mathematical Statistics, 2006 


EXTENDED STATISTICAL MODELING UNDER SYMMETRY; 
THE LINK TOWARD QUANTUM MECHANICS 


BY INGE S. HELLAND 
University of Oslo 


We derive essential elements of quantum mechanics from a parametric 
structure extending that of traditional mathematical statistics. The basic set- 
ting is a set A of incompatible experiments, and a transformation group G 
on the cartesian product T of the parameter spaces of these experiments. The 
set of possible parameters 1s constrained to lie in a subspace of II, an orbit or 
a set of orbits of G. Each possible model ıs then connected to a parametric 
Hilbert space. The spaces of different experiments are linked unitarily, thus 
defining a common Hilbert space H. A state 1s equivalent to a question to- 
gether with an answer: the choice of an experiment a € A plus a value for the 
corresponding parameter. Finally, probabilities are introduced through Born's 
formula, which 1s derived from a recent version of Gleason's theorem. This 
then leads to the usual formalism of elementary quantum mechanics in 1m- 
portant special cases. The theory is illustrated by the example of a quantum 
particle with spin. 


1. Introduction. Both statistics and quantum theory deal with prediction us- 
ing the concept of probability. Historically, the difference between the two disci- 
plines has been large, but in the last few years it has diminished, not in the least 
due to the recent work by Barndorff-Nielsen, Gill and Jupp [7]. 

The lack of contact between the two disciplines is of course related to the dif- 
ference in foundation, but one of the aims of the present paper is to argue that to 
a certain extent, this difference in foundation can be overcome. This may perhaps 
at first be difficult to believe: In statistics, the state of a given system 1s given sim- 
ply by a probability measure on some measurable space. In quantum theory in its 
most common formulation the state of a system is given by a vector v in some 
abstract Hilbert space. As a continuation of this formal theory, each observable is 
linked to a self-adjoint operator T on the same Hilbert space in such a way that the 
expectation of this observable in the state v is given by (v, T v). Associated with 
this is Born's formula: The transition probability from state u to state v is of the 
form |(v, u)|*. Also, in the absence of what physicists call superselection rules, lin- 
ear combinations of statevectors form new statevectors, which lead to interference 
phenomena unknown to classical statistics. 


Received March 2003; revised March 2005 

AMS 2000 subject classifications. Primary 62A01; secondary 81P10, 62B15. 

Key words and phrases Born's formula, complementarity, complete sufficient statistics, Glea- 
son's theorem, group representation, Hilbert space, model reduction, quantum mechanics, quantum 
theory, symmetry, transition probability. 


42 


STATISTICS AND QUANTUM MECHANICS 43 


The Born formula allows physicists to compute probabilities for sets of out- 
comes, perhaps as a function of certain parameters. Statistical methods can then 
be used for inference about these parameters, as discussed in [7]. By contrast, the 
present paper aims at giving a statistical interpretation of the vectors v themselves. 
If parameters are introduced as in op. cit., the total model will be similar to the 
hierarchical models used in Bayesian statistics. We will not use these latter kinds 
of parameters in the present paper. Our parametric models will be of the simplest 
kind, but we will emphasize that the choice between different experimental ques- 
tions to focus upon also may imply a choice between different parametric models. 

The quantum formalism as such is the result of a long development within 
physics, starting with discoveries by Max Planck, and where contributions have 
been made by Bohr, Pauli, Schródinger, Heisenberg and many others. There are 
many good books on quantum theory, for instance, [39], where also some of the 
philosophical background is discussed. 

Many authors have tried to find deeper foundations leading to the formalism of 
quantum theory. Several mathematical approaches are discussed in [60]. One such 
approach is quantum logic, treated in detail by Beltrametti and Cassinelli [12]. 

The earliest book on the mathematical foundation of quantum mechanics 
is [58]; 1n English translation, [59]. This book has had great influence; in its time it 
constituted a very important mathematical synthesis of the theory of quantum phe- 
nomena. The book can also be considered to be a forerunner of quantum probabil- 
ity. For physicists, von Neumann's book was supplemented by the book of Dirac 
[24], which started the development leading to modern quantum field theory. 

The development of quantum probability as a mathematical discipline, contin- 
uing the more formal development of quantum theory, was started in the 1970's. 
A first important topic was to develop a noncommutative analogue of the notion of 
stochastic processes; see [1] and references therein. Other topics were noncommu- 
tative conditional expectations and quantum filtering and prediction theory ([10] 
and references therein). 

Quantum probability was made popular among ordinary probabilists by 
Meyer [45]. A related book is [49], which discusses the quantum stochastic cal- 
culus founded by Hudson and Parthasarathy, but also many other themes related 
to the mathematics of current quantum theory. An example of a symposium pro- 
ceeding aiming at covering both conventional probability theory and quantum 
probability is [2]. 

There are also links between quantum theory and statistical inference theory. 
A systematic treatment of quantum hypothesis testing and quantum estimation the- 
ory was first given by Helstrom [37]. In [38] several aspects of quantum inference 
are discussed in depth; among other things the book contains a chapter on symme- 
try groups. A survey paper on quantum inference is Malley and Hornstein [43]. 

As an example of a particular statistical topic of interest, consider that of Fisher 
formation. Since a quantum state ordinarily allows several experiments, this con- 
zept can be generalized in a natural way. À quantum information measure due to 


Ad I. S. HELLAND 


Helstrom can be shown to give the maximal Fisher information over all possible 
experiments; for a recent discussion see [6]. 

One can thus point to several links between ordinary probability and statistics on 
the one hand and their quantum counterparts on the other hand. However, a general 
theory encompassing both sides, based on a reasonably intuitive foundation, has 
until now been lacking. 

The main purpose of the present paper is indeed to suggest a new approach to the 
statistical foundation of quantum mechanics based on elementary concepts such as 
choice of experiment, probability model, complementarity, symmetry and model 
reduction. I claim that this approach leads to a conceptual basis which is more 
intuitive than the usual one. This is of course a very bold statement, knowing how 
well established the ordinary quantum formalism is, especially since the program 
started here also needs further development. Nevertheless, I will claim that for 
readers knowing statistical theory and some group theory, the present approach 
will probably be more enlightening than the usual formalism. 

In addition to the implications for quantum theory, the concepts needed to com- 
plete this program, and also concepts learned directly from quantum theory, may 
at the same time turn out to lead to an enrichment of current statistical theory. 

An example is the concept of complementarity; in our approach this denotes 
the situation where two parameters cannot both be estimated accurately in a given 
context, but it can also be given a wider content. In our opinion this concept should 
not be confined to the microworld. This view is also in line with Bohr [16], who 
gave talks explaining the concept of complementarity to, among others, biologists 
and sociologists. 

A related generalization of the ordinary statistical paradigm will in fact be basic 
to our main setting: Before we look at the parameter of a concrete experiment, we 
consider all questions that can be addressed in any experiment in a given context. 
Thus there is a total parameter $, which is a vector containing all theoretical quan- 
tities that can be imagined for a given system. Any experiment which is chosen 
has a parameter that is a function of $, but $ itself has too rich a content to be 
estimated. Some ordinary statistical situations that can be fit into this pattern are: 


EXAMPLE 1. Consider all quantities of relevance that are contemplated at the 
experimental design phase. This can be made concrete in many different directions. 


EXAMPLE 2. A questionnaire is designed for a statistical investigation with 
a fixed number of alternatives for each question. Some respondents insist on giv- 
ing unexpected but informative answers, say, comments in addition to the fixed 
questions. The total parameter @ may contain some such possibilities. 


BXAMPLE 3. More generally: A statistical investigation on some group of 
humans is performed, say, through a questionnaire. Let $ contain all possible in- 
formation about these humans which may have some relevance to the concrete 
questions posed. 


STATISTICS AND QUANTUM MECHANICS 45 


EXAMPLE 4. There is a fragile apparatus for some specific length measure- 
ment which is destroyed after one measurement. Let yz be the length which is to 
be measured. Assume furthermore that the standard deviation of measurement o 
can only be estimated by destroying the apparatus. Let then $ = (u, o). 


EXAMPLE 5. Assume that a particular patient has an expected survival 
time A! if he gets treatment 1 at a specific time t, and expected survival time A? if 
he gets treatment 2 at that time. Here "expected" is not primarily meant in relation 
to a probability model, but may at this point be related to what is expected by the 
medical experts taking into account all knowledge they have about the patient and 
about the treatments. Then $ = (A! , A7) can never be estimated. 


EXAMPLE 6. Let there be two questions which are to be asked of an indi- 
vidual, where we know that the answer will depend on the order in which the 
questions are posed. Let (A1, A42) be the expected answer when the questions are 
posed in one order, and (A3, A44) when the questions are posed in the other order. 
Then @ = (A1,..., A4) cannot be estimated from one individual. 


Many more realistic, moderately complicated, examples exist, like the behav- 
ioral parameters of a rat taken together with parameters of the brain structure which 
can only be measured if the rat is killed. 

We will concentrate much on the statistical parameter space. Àn essential point 
of the statistical paradigm is that, before the experiment, the parameter A is un- 
known; afterward it is as a rule fairly accurately determined. In this way the focus 
is shifted from what the value of the parameter “is” to the knowledge we have 
about the parameter. In a physical context this can easily be made consistent with 
the point of view expressed by Niels Bohr, cited from [51]: *It is wrong to think 
that the task of physics is to find out how nature is. Physics concerns what we 
can say about nature." This statement is also in agreement with current views of 
quantum theory, as expressed, for instance, by Fuchs [27]. 

It is well known that there exists in the literature a large number of sugges- 
tions for interpretations of quantum theory; a very incomplete list is given by 
the references [13, 15, 20, 25]. Most of these interpretations include the ordinary 
minimalistic interpretation of Niels Bohr (the Copenhagen school or pragmatic in- 
terpretation concentrating on interpreting the outcomes of concrete experiments; 
for more details see [39]). The present article also implies a particular statistical 
interpretation related to the Niels Bohr interpretation, but it is beyond the scope of 
this paper to discuss in detail relations to other interpretation given in the literature. 

There are also a few related papers in the recent literature. Bohr and Ulfbeck 
[14] discuss a foundation of quantum mechanics which is based upon irreducible 
representation of groups, and thus uses symmetry in a way which is similar to ours. 
Caves, Fuchs and Schack [19] proposes a Bayesian approach to quantum theory 
based upon Gleason's powerful Hilbert space theorem. Here we will avoid taking 


46 I S. HELLAND 


an abstract Hilbert space as a point of departure, but we will arrive at it from a 
rather concrete setting. Finally, Hardy [32] derives quantum theory and probability 
theory from a few reasonable axioms, without going into any details concerning 
the state concept. 

sections 2—7 below are preparatory: In Section 2 group actions on the sample 
space and on the parameter space of an experiment are discussed, and the con- 
cept of permissibility 1s introduced. In Section 3 it is shown that permissibility 
always can be achieved by going to a subgroup; such a subgroup connected to an 
experimental parameter will be important later. In Section 4 the relation to causal 
inference, in particular to the concept of counterfactuals, is discussed, while in Sec- 
tion 5 the main quantum-mechanical example, electron spin, is treated. Section 6 
gives the starting point sketched in the abstract above: reduction of the cartesian 
product of the parameter spaces of complementary experiments, while Section 7 
treats model reduction in general and introduces the concept of group representa- 
tion. 

Then in Sections 8—10 the basic Hilbert space 1s introduced, first for a single ex- 
periment and then tied together for several complementary experiments. The treat- 
ment in these sections could have been simplified considerably by concentrating 
on the parameter space. The full discussion involving the sample space is included 
mainly for three reasons, however: First, this paves the way for further general- 
izations. Second, the context of an experiment is related to the limitation of the 
data that can be obtained, and this context is felt to play a role in the quantization. 
Third, a discussion of the full experiment is needed later in Section 12. 

Before that, in Section 11, operators and states are introduced. 

Àn important result is proved in Section 12: Born's formula for the transition 
probability between experiments. From this, the basic formalism of elementary 
quantum mechanics is derived in Section 13. 

In what follows, we will make several explicit assumptions; most of them are 
relatively weak and fairly natural in a statistical setting. The exceptions to this are 
Assumption 5, which is a simple assumption about the connection between the 
parameter spaces associated with different choices of experiments; Assumption 7, 
which through a limitation of the parameter space serves to restrict us to a dis- 
cussion of elementary quantum theory; and finally, Assumption 8, which gives the 
symmetry assumption needed to derive Born's formula and from this the formal- 
ism of elementary quantum mechanics. 


2. Statistical models and groups. In general the total parameter space 
Q— the range of the total parameter $—can have almost any structure; in this 
paper we will assume: 


ASSUMPTION 1. disalocally compact topological space. There is a transfor- 
mation group G acting on ® which satisfies certain weak technical requirements 
(see Appendix A.1) so that ® can be given a right invariant measure v, that is, 
a measure which satisfies v((d@)g) = v(d$). 


STATISTICS AND QUANTUM MECHANICS 4] 


Note that in this paper, group actions will always be written to the right: 
$> dg. The reason for this is simply that it facilitates the introduction of the 
right invariant measure, which from several points of view [34] in the case of a 
single parameter can be argued to be the best choice of a noninformative prior 
under symmetry in ordinary Bayesian statistical inference. 

The right invariant measure is unique (up to a fixed constant) for transitive trans- 
formation groups, that is, group actions where the space consists of one single or- 
bit. An orbit is defined as a set of the form {@:¢@ = $og:g € G}. In general the 
space ® can be divided into several orbits, and the invariant measure is unique on 
each orbit; it must be supplemented by some measure on the orbit indices in order 
to give a measure on the whole space ®. 

When a group G is defined on the (total) parameter space «b, an important prop- 
erty that an experimental parameter may or may not have is the following (cf. 
McCullagh [44], who chose to call this concept natural): 


DEFINITION 1. The parameter A is called permissible as a function A (9) if it 
satisfies: 


If A($1) —A($2) . thenA($i1g)— A(dog) — forallgeG. 


The most important argument for this restriction is that it leads to a uniquely 
defined action of the group G on the image space A of A(@): 


(b (4g)(9) = X(óg). 


Several general arguments for permissibility are given in [33, 34]: When this 
property holds, the best equivariant estimator, which essentially is the Bayes es- 
timator under prior v, is conserved under model reduction using functions of À. 
Also, in the transitive case credibility intervals under the invariant prior turn out to 
be identical to confidence intervals, and certain paradoxes related to Bayes estima- 
tion are avoided. 

Trivially, the total parameter A = ¢ itself is permissible. Also, the vector para- 
meter (A!,...,A*) is permissible if each A’ is permissible. 

As will be shown in the next section, if A is not permissible with respect to G, 
one can always define a maximal subgroup with respect to which A is permissible. 
This will be the usual case in our setting. 

Let now a general group D of transformations be defined on the parameter space 
A—the range of A. This transformation group D will be kept fixed, being thought 
of as a part of the specification of the problem in addition to the statistical model. 

Sometimes a group D of transformations on the sample space is defined first, 
and then the actions on the parameter space are introduced via the statistical model 
by defining probability measures P^ for g € D on the sample space X by 


(2) P^(B)—P^(Bg^') forsets B. 


48 I. S. HELLAND 


Then the connection between these two transformation groups is a homomorphism: 
If gı and g» are taken to act on the two spaces X and A, then gr and g1g2 act on 
both spaces in the same way. The concept of homomorphism will be fundamen- 
tal to this paper. It means that we have very similar group actions: The identity 
element, inverses and subgroups are mapped as they should be between the two 
transformation groups; that is, the essential structure is inherited. This is the rea- 
son why the same symbol D can and will be used for both transformation groups. 
If g is mapped by (2) into the identity e only when g = e, then the homomorphism 
will be an isomorphism: 'The structures of the two groups are then essentially iden- 
tical. If in addition a one-to-one correspondence can be established between the 
spaces upon which the groups act, everything will be equivalent. 

À further discussion of symmetry groups in statistics is given in [34] and in Ap- 
pendix A.1. Note that the existence of a group D acting on the parameter space A 
in fact requires very few explicit invariance properties. What is needed is basi- 
cally: (1) The sample space and the parameter space should both be closed under 
the transformations in the group. (i1) If the problem is formulated in terms of a loss 
function, this should be unchanged when observations and parameters are trans- 
formed conformably by the group. (iii) If a noninformative prior on A 1s needed, 
the right invariant distribution v on this space should be used. 


3. Experimental parameters and permissibility. Assuming that a parame- 
ter or total parameter œ is used to model some given part of reality, there are usually 
many questions that can be investigated 1n such a setting. Very often different such 
questions are addressed performing different experiments on the specific part of re- 
ality in question. (A related case 1s when different questions are addressed within 
the same experiment, e.g., when statisticians consider different sets of orthogonal 
contrasts in an analysis of variance experiment.) 

Let Æ be the set of such questions from now on in this paper assumed to be 
connected to different experiments. 


ASSUMPTION 2. For each a € A there is a parameter A^ = A^ (4), for which 
we assume that a probability model P^" (-) exists corresponding to experiment a. 
It is assumed that each experiment is maximal, that is, that there exists no possible 
experiment with parameter 4^ such that A^ is a proper function of u°. 


In a physical context, P^" (-) should be the probability measure for the measure- 
ment apparatus, at the present moment left unspecified. 

When we in the sequel talk about choice of experiment/question a, we really 
mean a choice of (a, A^). But the probability measure p^ (-) is thought to be 
connected to the measurement apparatus, and is not at the outset included in this 
choice. Quantum probabilities are first introduced in Theorem 5. 

When a transformation group G is defined on the (total) parameter space ®, an 
important property of the experimental parameter A? is whether it is a permissible 


STATISTICS AND QUANTUM MECHANICS 49 


function A^ ($). As already said, the most important argument for this restriction is 
that it leads to a uniquely defined transformation group G^ on the image space A^ 
of A" ($), so that (A? g^)($) = A^($g^) for g^ e G^. 

As a simple illustration of a group connected to a parameter space or the total 
parameter space, look at the (total) parameter @ = (u, o) with the translation/scale 
group (u, 0) — (a+ bu, bo) where b > 0. The following one-dimensional para- 
meters are permissible: 14, o, u’, u +0, + 30 , and if a such parameter is asked 
for some reason, say as a focus parameter, all these give valid candidates. 

On the other hand, the following parameters are not permissible, and would 
according to McCullagh [44] lead to absurd focus parameters under this group: 
u+o*, o e", tan(u)/sin(o). 

A further example is given by the coefficient of variation o/u. This is not per- 
missible. (The location part of the transformation does not make sense here.) But 
it will be permissible if the group is reduced to the pure scale group (u, o) => 
(bu, bo), b > 0. This points at an important general 


PRINCIPLE. If a focus parameter A^ ($) is not permissible with respect to the 
basic group G, then take a subgroup G^ so that it becomes permissible with respect 
to this subgroup. 


LEMMA 1. Given a parameter A", there is always a maximal subgroup G? 
of G such that A^ is permissible with respect to G^. 


PROOF. Let G^ be the set of all g € G such that for all $1, $2 € we have 
that A7 ($1) = A?(d») if and only if åf (ġie) = A^($29). Then G? contains the 
identity. Furthermore, using the definition with $1, $? replaced by $121, $281, it 
follows that g1g2 € G^ when g; € G^ and g2 € G^. Using the definition with 
$i. $2 replaced by 12^, $28 ^ , it is clear that it contains inverses. Hence G^ is 
a group. It follows from the construction that itis maximal. [i 


From this it follows that the group G^ also acts on A^ = àf (ẹẸ), by a simple 
homomorphism determined as in (1). 


4. Experimental parameters and counterfactuals. In our view this choice 
of experiment can also be related to the literature on causal inference, in particular 
to the concept of counterfactuals, which has a central place there. A counterfactual 
question is a question of the form: “What would the result have been if ...?". 
A counterfactual variable, in the way this concept is used in the literature, is a 
hypothetical variable giving the result of performing an experiment under some 
specific condition a, when this condition a is known not to hold. À typical example 
is when several treatments can be allocated to some given experimental unit at 
some fixed time, and then in reality only one of these treatments can be chosen. 


50 I. S. HELLAND 


The use of such a concept goes back to Neyman [48], and has in recent decades 
been discussed by, among others, Rubin [54], Robins [52, 53], Pearl [50] and Gill 
and Robins [29]. On the other hand, Dawid [21] is skeptical of an extensive use 
of counterfactuals. The discussion of the last paper shows some of the positions 
taken by several prominent scientists on this issue. 

In our setting, we choose and perform one experiment a, and then any other 
experiment b imagined at the same time must be regarded as a counterfactual ex- 
periment. However, instead of introducing counterfactual variables, I use counter- 
factual parameters A^, which in my view is a more useful concept. Parameters are 
hypothetical entities that usually cannot be observed directly. Nevertheless they 
may be useful in our mental modeling of phenomena and in our discussion of 
them. In the last decades, such mental models in causal inference have been devel- 
oped to great sophistication, among other ways by using various graphical tools 
[41, 50]. In the present paper we will limit mental models to scalar and vector pa- 
rameters, some counterfactual, leading to what we have called a total parameter, 
but this model concept can in principle be generalized. 

When it is decided to perform one particular experiment a € A, the A^ becomes 
the parameter of this specific experiment, an experiment which then also may in- 
clude a technical or experimental error. In any case, the experiment will give an 
estimate À?. If the technical error can be neglected, we have a perfect experiment, 
implying A? — 44. 

We are here at a crucial point for understanding the whole theory of this paper, 
namely the transition from the unobserved parameter to the observed variable. 
Let us again look at a single patient at some given time who can be given two 
different treatments. Define A" as the expected survival time of this patient under 
treatment a. Then make a choice of treatment, say a — 1. Ultimately, we then 
observe a survival time t! for this patient. There is no technical error involved 
here, so we might say that we then have A! — Àl = tl, And this is in fact true. 
Per definition, A! is connected to the single patient, the definite treatment time and 
a definite choice of treatment. So even though A! is defined at the outset as an 
unknown parameter, its definition is such that, once the experiment is carried out, 
the parameter must by definition take the value t!. 

This simple, but crucial phenomenon, which 1s related to how a concept can be 
defined in a given situation, is in my view of quantum mechanics closely connected 
to what physicists call "the collapse of the wave packet" when an observation is 
undertaken. 


S. A quantum particle with spin. Perhaps the most simple quantum- 
mechanical system is an electron with its spin. The spin component A can be 
measured in any space direction a, and A always takes one of the values —1 or +1. 
Given such a (perfect) measurement, this defines in the usual quantum formalism 
a certain state vector v in a complex two-dimensional vector space H, formally as 
the eigenvector of an operator corresponding to the given measurement with the 


STATISTICS AND QUANTUM MECHANICS 51 


given measurement value as eigenvalue. And given this state vector v, quantum 
mechanics offers formulae, versions of which will be discussed later, for predict- 
ing the results of further measurements. This quantum-mechanical model for the 
electron also has several applications to other systems. The setup itself is generally 
called a qubit in the literature. 

As a contrast to this formalism, and to illustrate the general theory of this paper, 
we give a nonstandard description of a particle with spin, a description which 
will turn out in the end to be essentially equivalent to the one given by ordinary 
quantum theory. 

The total parameter $ corresponding to electron spin may be defined as a vec- 
tor in three-dimensional space; the direction of the vector gives the spin axis, the 
norm gives the spinning speed. The associated group G is then the group of all 
rotations of this vector in R? around the origin. At the outset, œ is a model quantity 
and hence unknown. As indicated before, we will assume throughout that such a 
total parameter can never assume a definite value in the sense that it never can be 
estimated. Nevertheless, such an abstract quantity turns out to be useful in model 
discussions. 

Now let the electron have such a total parameter $ attached to it. Assume first 
that the system defines a context such that it is only possible to estimate some 
given component of $. From this point of view, the most that we can hope to be 
able to measure is the angular momentum component 0^ ($) = |@|cos(a@) in some 
direction given by a unit vector a, where a is the angle between $ and a. 

The function 0^ (-) is easily seen to be nonpermissible for fixed a. This is sim- 
ply because two vectors with the same component along a in general will have 
different such components after a rotation. The maximal possible choice of the 
group G^ with respect to which 0^ (-) is permissible is the group of rotations of the 
unit vector around the axis a, possibly together with a 180? rotation around any 
axis perpendicular to a. 

The group G^ also acts on the image space for 0^. This group action has several 
orbits: For each « € (0, 1], one orbit is given by the two-point set {—x, x) in €^. 
In addition there is an orbit for « = 0. 

We want in general that any reduction of the parameter space should be to an 
orbit or to a set of orbits. Since the value of « may be considered to be arbitrary, 
we concentrate on A^ = sign(0^), taking the two values —1 and +1. This also im- 
plies that the function A^($) is permissible with respect to the group G4, and that 
this group acts upon A^ by exchanging its two values. Assume now that the elec- 
tron in itself defines such a context that only A^ can be measured, an assumption 
which is consistent with experience. The apparatus usually used to measure such 
a discretized spin component is called a Stern-Gerlach device. 

The unconditional prior probability for A^ is 1/2 for each of the values +1 by 
symmetry. Assume now that we know that A^ = +1, and that we afterward will 
measure the spin component in another direction b. We assume for simplicity that 


52 I. S. HELLAND 


we have an ideal measurement apparatus in the direction b, so that what we seek 
is the transition probability in parameter space, 


POP = +1 A2 = +1). 


The formal quantum-mechanical solution of this is well known in the physics 
literature. Let the components of the (unit) a-vector be (ax, ay, az), and let ox, oy 
and o; be the three Pauli spin operators 


o «(D v V) a 2) 


Calculate the eigenvector v^ for the operator a,0, + ayoy + a;0; corresponding 
to the eigenvalue +1, and do a similar thing in the b-direction. Then the formalism 
of quantum mechanics (see Section 14 below) says that 


(4) P(A? = +1 a2 = 41) = v?! vh? 
A straightforward calculation then gives 
(5) PAP = +1/A2 = +1) = (1 + cos(u))/2, 


where u is the angle between the a-vector and the b-vector. 
A general statistical approach to transition probabilities is given in Theorem 5 
below. 


6. Parameters of several statistical experiments. Up to now, we have as- 
sumed the existence of a total parameter. This section gives a very general alterna- 
tive way to arrive at this concept. 

Consider a set »& of mutually exclusive experiments, each of the ordinary statis- 
tical kind, but we will concentrate on the parameter spaces A^; a € “A. The whole 
set of parameters of the experiments is given by points in the big space 


Hex AS, 
a 


a Cartesian product. If all parameter spaces have the same structure A, this can be 
considered to be the set of functions from A to A. 
Let there be defined a transformation group G on II. 


EXAMPLE 7 (Compare Example 5). Letz = (Al, 4”), where A! and A? are the 
expected lifelengths of a single patient under two mutually exclusive treatments. 
Let G be the joint set of time scale transformations together with the exchange 
Al <> M. 


EXAMPLE 8. Consider again the electron spin. Let x = (Àf; a € A), where 
A? is the spin component +1 of a perfect measurement in the direction a of an 
electron. Let G be the group generated by the transformations: 


STATISTICS AND QUANTUM MECHANICS 33 


(i) Inversions: A^ > —2^. 
(ii) Rotations of experiments: If a +> ao under a rotation o, replace each A^ 
with A??, This gives a permutation within the cartesian product. 


Note in general that the points of II make sense mathematically, but not directly 
physically, hence it does not make sense in a physical context to give values to the 
individual points of this space. The space II will hence not be called a state space. 

So what operations are meaningful with the spaces I1? I have mentioned group 
operations. One can also adjoin such spaces corresponding to different systems, 
and adjoin x with some other parameter. Finally, one can look at subspaces. 

Assume that the experiments are related in some way. Then it may be reason- 
able to try to reduce the space IT. The purpose of this reduction may be to achieve 
parsimony. This should not be thought of as an approximation, however, but may 
be a result of some physical theory. Note that theories are formulated not in terms 
of observations, but in terms of parameters, the theoretical language behind obser- 
vations. 

Let II be reduced to a subspace ¥ with the property: 


PROPERTY 1. WY is an orbit, that is, a set of the form {x : x = 10g: 2 € G}, or 
a set of orbits for the group G. Use the notation G also for this group acting on V. 


This is a necessary condition in order that G should be a transformation group 
on the reduced space. It is also consistent with the discussion elsewhere in this 
paper. In [34] there are given several examples of model reductions connected to 
single experiments where the reduced space is an orbit or a set of orbits of an 
associated transformation group. 

It is natural in certain situations to demand also: 


PROPERTY 2. Each section (zt € II: A^ (zr) = Ao} has a nonzero intersection 
with W for a set of specified values Ao. 


In fact, this will always be true for some values Ao. In a future publication we 
hope to use this fact together with some group representation theory to discuss 
quantization itself. 

Let now the model reduction be associated with some function $ on II which 
is one-to-one on the subset W and undefined elsewhere. It follows then from Prop- 
erty 1 that the group G is well defined on the range of $. 


DEFINITION 2. If such a function exists, call «b = $(W) the total parameter 
space. Any function with the above properties 1s called a total parameter. 


A total parameter $ can in principle be replaced by any other total parameter in 
one-to-one correspondence with $. But it is important to have a simple represen- 
tation. 

If Property 2 holds, then each A^ can be regarded as a function on d. 


54 I. S. HELLAND 


EXAMPLE 8 (continued). Restrict II to the subset V, the set of all z such that 
there exists a vector $ that gives each A^ equal to sign(a - $). Let $ (zt) be this 
direction normed as a unit vector. 


— Taken as a unit vector $ (7t) is a unique function of 7r. 


PROOF. Suppose that there is a x which corresponds to two different unit 
vectors $, and $5. Then a = $4 — dg, normalized gives A^ = +1 corresponding 
to $1 and A^ = —1 corresponding to $2, a contradiction. LU 


— The set W is an orbit of G. 

PROOF. Itis easy to see that V is closed under inversions and rotations. || 
— All sections [zt : àf (zt) = +1} have nonzero intersections with V. 

PROOF. Obvious. LJ 

From this, we are back to the situation discussed in Section 5. 


7. Experiment, model reduction and group representation. Now let the 
experimentalist have the choice between different experiments a € “A on the same 
unit(s), where the experiment a consists of measuring some y^, with y^ = yf (œ) 
being a function on some sample space S, and where the measurement process is 
modeled with a parameter A^. This parameter is a part of the model description of 
the units, and all the model parameters may be seen as functions A^(4$) of a total 
parameter @. 

We use a common sample space S for all experiments a, since this space can be 
imagined in terms of a common measurement apparatus or some set of apparatus. 
Specifically we assume: 


ASSUMPTION 3. There is a common sample space $. The reduced model 
probability measures P^" are jointly dominated, that is, absolutely continuous with 
respect to a fixed probability measure P on the sample space S. 


In the electron case this simply means that one in principle can assume that the 
same or the same kind of Stern-Gerlach apparatus can be used for every measure- 
ment. The measure P can be assumed to be Bernoulli(1/2). 

In the previous section, a global model reduction was introduced by reducing 
the large space II to one or a few orbits of the basic group G. As in the electron spin 
example, it may also be natural or necessary to reduce the original parameter 0^ 
to a new parameter A^. All such model reduction is done by selecting one or a few 
orbits of the relevant group G^. 


STATISTICS AND QUANTUM MECHANICS 55 


The most important theoretical argument for model reduction associated with 
orbits of the group is the following: All models should have a parameter space 
which is invariant under the group. For the reduced model this is only possible 
when the parameter space in question is composed of orbits of the relevant group. 

Here is another argument: The Pitman estimator is equal to the Bayes estimator 
under right invariant prior, and this estimator is important in many applications. In 
order that this shall make sense for the reduced model, the parameter space of this 
reduced model must be constructed from orbits of the parameter group actions. 

A further discussion of model reduction under symmetry in statistics and in 
quantum mechanics will be given elsewhere, and we then also hope to relate the 
discussion to the concept of group representation, which is very useful in quantum 
theory. 

Generally (see also Appendix A.2), a group representation is a class of opera- 
tors {U (e); g € G} on a vector space space V, where G is a group, such that the 
operators satisfy the property U (gh) = U (g)U (h). This gives a group of operators 
homomorphic to the group G, and, as the name says, it is used to represent the 
group in a specific way. There is a large mathematical literature on group repre- 
sentations. 

Specifically, the regular representation U (G) on L?(®, v), where v is a right 
invariant measure for the basic group G, is given by 


(6) U (g) f(b) = f ($8). 


Explicitly, this implies that U (G) is a group of linear operators acting on L*(9, v). 
The group property of U(G) is well known and easily verified. The same for- 
mula (6) is valid for any subspace V of L?(®, v) which is invariant under the 
group of operators U (G), that is, such that U(g) f € V when fe V andg € G. 

We will also consider group representation spaces of the group G^ acting on $. 
Let A^ be a permissible function of $. Then 


V? = (f ELO, v): FO) = FOON) 


is an invariant subspace of L?(®, v) under the regular representation U (G^). 


8. Experimental basis and the Hilbert space of a single experiment. Upto 
now the discussion has been largely in terms of models and abstract parameters. 
Now we introduce observations in more detail. We have already stressed that in 
a given situation we have a choice between different experiments/questions a. In 
this section we give a general discussion fixing this experiment, and hence fixing 
the parametric function A($). Given a measurement instrument, this will lead to a 
statistical model P^. 

In this section we will need to introduce some statistical concepts; for a more 
thorough treatment, see, for example, [42]. 

We use the ordinary concept of sufficiency, repeated for convenience: 


56 I.S HELLAND 


DEFINITION 3. A random variable t = t(w); o € S connected to a model P^ 
is called sufficient if the conditional distribution of each other variable y, given t, 
is independent of the parameter A. 


A sufficient statistic f is minimal if all other sufficient statistics are functions 
of t. It is complete if 


(7) E^(h(t)) 20 forall A implies A(t) =0. 


It is well known that a minimal sufficient statistic always exists and is unique 
except for invertible transformations, and that every complete sufficient statistic is 
minimal. If the statistical model has a density belonging to an exponential class 


b(y)d(A) ect o» 


and if c(A) = {c(A):A € A} contains some open set, then the statistic t is complete 
sufficient. 

Recall that a function £(A) is called unbiasedly estimable if E*(y) = £(A) for 
some y. Given a complete sufficient statistic t, every unbiasedly estimable function 
£(X) has one and only one unbiased estimator that is a function of £. This is the 
unique unbiased estimator with minimum risk under weak conditions [42]. Thus 
complete sufficiency leads to efficient estimation. 


ASSUMPTION 4. For each a e A the experiment can be chosen in such a way 
that there is a complete sufficient statistic £f under the model P^. 


For the rest of this section we fix such an experiment and drop the index a. We 
write D for G^, which will be a fixed group on the common sample space S, but 
also acts on the selected parameter space. 


DEFINITION 4. The Hilbert space K is defined as the set of all functions h(t) 
such that A(t) € L2(S, P) and f(b) = EA (h(t)) e L7(«, v). 


In this definition the function A is assumed to be complex-valued. It is easy 
to see that (7) holds for complex functions if and only if it holds for real-valued 
functions. 

A sufficient condition for f € L?(®, v) is that [f E O(A) víde) < oo. 
Since it is defined as a closed subspace of a Hilbert space, the Hilbert space prop- 
erty of K is seen to hold. 

Let then the group D be acting upon the sample space S, on the parameter 
space A and on the total parameter space «b. Recall the brief discussion of group 
representations in Section 7. In particular, recall the definition of the space V}, an 
invariant space under the regular representation of the group D on L? (Ẹ, v). 


STATISTICS AND QUANTUM MECHANICS 57 


PROPOSITION 1. Each space K is an invariant space for the regular repre- 
sentation of the observational group D on L?(S, P), that is, under U (g)h(t) = 
h(tg); g € D. 


PROOF. Ift is sufficient under the model P^, and D is the group acting on 
the sample space, then tg given by (tg)(w) = t(wg) is sufficient for all g € D. 
This is proved by a simple exercise using (2). Also, if t is complete, then tg must 
be complete; hence the two must be equivalent. The norm conditions are easy to 
verify. Therefore K is invariant under D. C 


Consider now the operator A from K to V, C L7(9, v) defined by 
(8) (46.46) = | VPO (da) - E" (y) 


using again the (reduced) model P^ (do) corresponding to the experiment a. In the 
following it will be important to use K to construct a Hilbert space related to the 
parameter space. 


DEFINITION 5. Define the space L by L — AK. 


By the definition of a complete sufficient statistic, the operator A will have a 
trivial kernel as a mapping from K onto AK. Hence this mapping is one-to-one. It 
is also continuous and has a continuous inverse. (See below.) Hence L is a closed 
subspace of L?(6, v), and therefore a Hilbert space. Note also that L is the space 
in L?(, v) of unbiasedly estimable functions with estimators in L?(S, P). It is in 
general included in the space Vj of all functions of the parameter A. 


PROPOSITION 2. The space L is an invariant subspace of 1? (6, v) for the 
regular representation of the group D on L? (®, v). 


PROOF. Assume that £(A) = E? (y) is unbiasedly estimable. Then also (A) = 
E(Ag) = E8 (y) = EÀ (yg) is unbiasedly estimable, so L is an invariant space under 
the regular representation U of D, defined by U(g) FA) = fg). O 


A main result is now: 
THEOREM 1. The spaces K C L?(S, P) and LC L^(6, v) are unitarily re- 


lated. Also, the regular representations of the group D properly defined on these 
spaces are unitarily related. 


PROOF. We will show that the mapping A can be replaced by a unitary map 
in the relation L = AK. 


58 I. S. HELLAND 


Recall that the connection from the observation group to the parameter group 
D is given from the model by 


(9) P^4(B)—P^(Bg! | geD. 


Using the definition (8) and the connection (9), we find the following relation- 
ships. We assume that the random variable y(-) belongs to K C L^(S, P) and that 
U is chosen as a representation on the invariant space L. Then 


U(g)AyQ) = | yo) P (dw) 
(10) - | y(w)P*(dog™!) 


= | y(wg)P* (do) = AU(g)y(A), 


where U is the representation on K given by U y(c) = y(wg), that is, the regular 
representation on L*(S, P) restricted to this space. 

Thus U(g)A = AU (g) on K. 

Hence 


U(g) -U(g) -AU(gA 5;  geD. 


Recall that the action of D on A is defined by (Ag)($) = A(@g), and that U (g) = 
U (g) on V. Here U(g) f ($) = f($g) when f € Vj and g € D. 

By Naimark and Stern ([47], page 48), if two representations of a group are 
equivalent, they are unitarily equivalent. (The result there 1s formulated for the 
finite-dimensional case, but the proof is valid in general.) Hence for some uni- 
tary C we have 


(11) U(g) 2 CU(g)C*. 


Since the unitary operators in this proof are defined on K and L, respectively, it 
follows that these spaces are related by L = CK. 

Definition 4 may also be coupled to the operator A and to an arbitrary Hilbert 
space K' of sufficient statistics, which for instance may be the whole space 
L^(S, P). First let 


(12) M = (y € K': E^y — 0 for all A). 


Then K may be considered as the factor space K'/M, that is, the equivalence 
classes of the old K’ with respect to the linear subspace M (cf. [47], I.2.10IV). 

Here is a proof of this fact: Let £ € AK’, such that £(A) = E^(y) for some 
y € K’. Then y is an unbiased estimator of the function £(A). By Lehmann and 
Casella ([42], Lemma 1.10), £(A) has one and only one unbiased estimator which 
is a function h(t) of t. Then every unbiased estimator of (À) is of the form y = 
h(t) +x, where x € M; this constitutes an equivalence class. On the other hand, 
every h(t) can be taken as such a y. |j 


STATISTICS AND QUANTUM MECHANICS 59 


9. The parametric Hilbert space of a selected experiment. Return to the 
situation where one selects an experiment a among a class of experiments A. Cor- 
responding to this choice we now have a parametric Hilbert space L^ and an ob- 
servational Hilbert space K^. This models a certain measurement apparatus, and 
in many cases one would expect that the parameter space, and hence the space L^, 
will represent some intrinsic property of nature, and therefore be independent of 
the choice of measurement apparatus. 

However, to cover all cases, and to get a unique definition, we will define the 
parametric Hilbert space connected to question a € “A through a special choice of 
measurement apparatus. 


DEFINITION 6. (1) Before any experiment is done, A^ is just the name of some 
parameter. After the experiment, we have some estimate À“ of this parameter. The 
experiment is called perfect if experimental error can be neglected, so that À^ is 
the realized value of the parameter in this experiment. 

(ii) Define the Hilbert space H^ connected to question a € A as the space L? 
for a perfect experiment with parameter A^. 


One remark is that even in the perfect case it may be important to distinguish 
between a parameter and its realized value. In the electron spin case, a perfect 
measurement means simply that the Stern—Gerlach apparatus functions without 
any error. 

We will see later that under natural assumptions a nonperfect experiment may 
be related to the same space H^. 


PROPOSITION 3. With the above definitions the space H° is just the space V1 
of functions f of X* (-) such that f(b) = f (A*($)) € L?(6, v). 


PROOF. If f is arbitrary and the experiment is perfect, then Sif (A2)|? dP = 
' f (A2(@))|? is finite. This then follows from Definitions 4, 5 and 6. O 


As an example, in the electron spin case, the total parameter $ is the spin vector 
and L*(®, v) corresponds to a measure v which is uniform on any shell, and where 
any measure on |ó| can be used. Let A^($) = sign(a - $). Then H^ is simply the 
space of functions of àf (ġ), a two-dimensional space. Specifically, H? is the space 
of functions of $ which are constant on the two half-spaces separated by a plane 
through the origin perpendicular to the vector a. 

All this indicates that our discussion could have been simplified by concentrat- 
ing on the parameter space. Our reasons for nevertheless giving a full treatment 
involving the sample space have been given in the Introduction. 


60 I. 5. HELLAND 


10. The quantum-theoretical Hilbert space. Our task in this section is to 
tie the spaces H^ together. Our essential point of departure here is that the pa- 
rameter spaces of the different experiments have a similar structure. Then it is 
not unreasonable to assume that they can be transformed over to each other by 
some element of the basic group G. This will not give the most general case of 
the quantum-mechanical formalism, but gives a treatment which includes qubits, 
higher spins, several particles and the most important cases of entanglement, a 
phenomenon which is much discussed in the quantum-mechanical literature. 


ASSUMPTION 5. For each pair of experiments a,b € “A there is an ele- 
ment gg, of the basic group G which induces a correspondence between the re- 
spective parameters, 


(13) w= eqn or AP($)-—A"(ógap). 


This assumption is fairly strong, and it makes the task of connecting the spaces 
really simple. On the other hand, it seems to be satisfied in concrete cases. The 
same assumption will be needed in Section 12. 

In the electron spin case ® was a space of vectors, and G was the rotation group 
together with changes of scale. Then (13) holds if gap is any rotation transform- 
ing a to b. 

If (13) holds for transformations on some component spaces, it also holds for 
the cartesian product of these spaces when the relevant cartesian product of groups 
is used. 

Another interesting relation is connected to Assumption 5 in the following way: 
(13) implies that one ought to have A? g? = A49" gap for some g? € G^. Hence it 
follows that A^ gab8? = A“ 8° gat, so g^ and gang” gap act in the same way on A^. 
One can give many examples of group transformations where g^ = gang? Bs 
holds in general, giving an isomorphism between the groups G^ and G^. 

Assumption 5 will be crucial in connecting the Hilbert spaces H^ for the differ- 
ent experiments. First, from the construction of the Hilbert spaces, H^ is a space 
of functions of A?($), and H? is a space of functions of AP (6). Furthermore, the 
spaces are constructed in the same way. Specifically, if f^($) = f(A%(@)) and 
f^ (4) = FAP (9)), then by (13) we have 


(14) f^ (à) = f" (bgav) = U (gab) f" (0). 
This implies: 
THEOREM 2. (a) There is a connection between the spaces H^ and H? given 
by 
(15) H^ = U (g;j)H*. 


STATISTICS AND QUANTUM MECHANICS 61 


(b) There are a Hilbert space H and for each a € A a unitary transforma- 
tion E* such that H° = E^H. 

(c) For any experiment satisfying Assumption 4 and such that the parametric 
Hilbert space L^ is equal to H°, there are unitary transformations F^ such that 
the observational Hilbert spaces satisfy K^ — F^H. 


PROOF. (a) Proved above. 
(b) Obvious from (15). The space H can be chosen as any fixed H*. 
(c) From (a) and Theorem 1. LI 


Now introduce: 


ASSUMPTION 6. The group G is the smallest group containing all the sub- 
groups G^. 


From this we get: 


THEOREM 3. H is an invariant space for some abstract representation W of 
the whole group G. 


PROOF. It follows from Proposition 2 that H^ is an invariant space for the 


group G^. 
This can now be extended. Observe first that 
(16) W (818283) = EU" (g1)E E U^ (90) EP EC  U*(g3) E 


gives a representation on H of the set of elements in G that can be written as a 
product g1g523 with g1 € G^, g2 € G^ and g3 € G°. 

Continuing in this way, using Assumption 6, implying that the group G 1s gen- 
erated by (G^; a € A), we are able to construct a representation W of the whole 
group G on the space H. In particular, one is able to take H as an invariant space 
for a representation of this group. L 


As an example, the two-dimensional Hilbert space of a particle with spin is 
always an (irreducible) invariant space for the rotation group. This determines to a 
large extent H, if we in addition assume H to be as small as possible. In general, 
the requirement that H should be a representation space for G may put a constraint 
on the dimension of H. 

The construction above gives a concrete representation of the quantum- 
mechanical Hilbert space. Since all Hilbert spaces of the same dimension are uni- 
tarily equivalent, other representations—or just an abstract representation—may 
be used in practice. This is sufficient to give the Born formula as proved below, 
and through this the ordinary quantum formalism. But the concrete representation 
facilitates interpretation. 


62 I. S. HELLAND 


For our construction, the unitary connection (15) between the Hilbert spaces 
for single experiments is the most important premise. This can easily also be re- 
lated to the space-time issue. Say, let £ be the theoretical position, zr the theoretical 
momentum, and let H! and H? be the corresponding L?-spaces of parametric func- 
tions. Then we can consider the unitary transformation from H! to H^ given for 
some constant fi by 


falm) = | eft (ETE), 


1 
4/2: 3.14 
and in this way introduce a common Hilbert space. This can be connected to the 
relevant group, namely the group of space translations together with the Lorentz 
group, and it can be argued that A should be a universal constant. This will be 
further discussed in [36]. From physics it is known that A = 1.055. 10794 Js. 


11. Operators and states. So, by what has just been proved, for each a the 
Hilbert space H^ of unbiasedly estimable functions of A^ can be put in unitary 
correspondence with a common Hilbert space H. From now on we shall make an 
assumption which is common in elementary quantum mechanics, but which is very 
restrictive from a statistical point of view. 


ASSUMPTION 7. Each reduced (maximal) parameter A^ takes only a finite or 
denumerably infinite number of values A;. 


LEMMA 2. These values can be arranged such that each Ay = Ay is the same 
for all a (k — 1,2, ...). 


PROOF. By Assumption 5 
(6 :AP(9) =AP} = (9:A* (6ga5) = AD) = (6:A* (9) = AP) goa. 


The sets in brackets on the left-hand side here are disjoint with union ®. But 
then the sets on the right-hand side are disjoint with union gj; = ®, and this 
implies that (A2) gives all possible values of A^. Lj] 


In spite of Lemma 2, since in any statistical model a parameter can be changed 
to any one-to-one function of it, we may sometimes use the notation A; in order to 
have the most general treatment. 

In the finite case Assumption 7 implies that G^, as acting upon A^, is a group of 
permutations, and that the corresponding invariant measure is the counting mea- 
sure. 

Recall that the Hilbert space H is chosen as one fixed space H*. In this space 
let f; ($) be defined as the trivial function which equals 1 when A*($) = Àj, 
otherwise 0. These are eigenfunctions of the operator $° defined by S° f ($) = 


STATISTICS AND QUANTUM MECHANICS 63 


A°(p) f (p). In a different space H^ these functions correspond to f/($) = 
ff (Ógca) = U (8ca) f; ($). Now define vectors in H by 


(17) vj e W(8ca) fF, 


where W is the representation defined by (16). These are eigenvectors of the self- 
adjoint operator 7^ = W (gca)S° W (gac) with eigenvalues à}. 

An eigenvector vi represents the statement that the parameter A^ has been mea- 
sured with a perfect measurement that has given the value À}. 

In general it is not true that all unit vectors of H can be given such an inter- 
pretation. Among other things one has to take into account what are called super- 
selection rules: For an absolutely conserved quantity u, the linear combinations 
of eigenvectors corresponding to different eigenvalues of the operator associated 
to yz are not possible state vectors. Superselection rules are well known among 
physicists, but they are not always stressed in textbooks in quantum mechanics. 

In [35], Theorem 6 and Lemma 2, we proved the following under the assumption 
that the unitary group generated by {W(g)} and the phase factors is transitive on 
the component spaces H, below: 


THEOREM 4. There is a decomposition of H of the form H1 @ H2 9 >, 
where each H, is an irreducible invariant space under the group G. Assume that 
the unitary group generated by {W (g)) and the phase factors e'” is transitive on 
each component H,. Then all unit vectors of each H, are unitarily equivalent to 
some £^ , an indicator of an event AP = AP . On the other hand, if two such indica- 
tors, Say T and f°, are unitarily equivalent to the same v € H, , and the relevant 
unitary transformation can be considered as a subrepresentation of the regular 
representation, then there is a one-to-one function F such that X* = F(A") and 
A5 FD. 


In simple terms a state is characterized by the fact that a (maximal) perfect 
measurement is performed, and this has led to some value of the corresponding 
maximal parameter. Concretely: A perfect experiment a € Æ has led us to con- 
sider the Hilbert space H^, and the result A^ = A, is exactly characterized by the 
indicator function ff. Translated to the H-space, the state given by the information 
À* = A, is then characterized by the vector vy. 


COROLLARY 1. Under the Assumptions of Theorem 4, all unit vectors of each 
irreducible space H, can be taken as state vectors with the following interpreta- 
tion: A question a € A (or more precisely: What is the value of A??) has been 
asked, and the answer is given by the realized value A? = ix, or in other words: 
A perfect measurement corresponding to the reduced parameter A" has been per- 
formed, and the result is Ae — Àk. 


64 I.S HELLAND 


This is consistent with the well-known quantum-mechanical interpretation of a 
state vector. In our treatment, this interpretation of a state as a question-answer 
pair is crucial. 

The operator 7^ may be written 


(18) T? — V Ae. 
k 

These operators are self-adjoint, and they satisfy the trivial relation 
vet Ta Up = Àk. 

Using the results of this section to construct the joint state vector for a system 
consisting of several partial systems, with symmetries only within the partial sys- 
tems, one follows the recipe ER — v) & v Q 9 where it is assumed that 
system K is in state A^* = 2, for k = 1,2,3. By time development under 1nterac- 
tion, as described by the Schródinger equation, or by other means, other, entan- 
gled, multicomponent states will occur. This will be further discussed in [36] and 
elsewhere. 


12. Born's formula. We have now obtained a statistical interpretation of the 
quantum-mechanical Hilbert space: Under the assumptions of Theorem 4 all vec- 
tors in that space can be equivalently characterized as question-answer pairs and, 
furthermore, the Hilbert space is invariant under a suitable representation of the 
basic group G. 

To complete the derivation of the formalism of quantum mechanics from the sta- 
tistical parameter approach, the most important task left is to arrive at the Born for- 
mula, which gives the probability of transition from one state to another. The fact 
that such a formula exists is amazing, and must be seen as a result of the symme- 
try of the situation together with the limitation imposed by the Hilbert space. Even 
though I use a different approach, my own result is related to recent attempts to link 
the formula to general decision theory: An interesting development which goes in 
this direction was recently initiated by Deutsch [22]. The approach of Deutsch has 
been criticized by Finkelstein [26], by Barnum et al. [8] and by Gill [28], who gave 
a constructive set of arguments using three reasonable assumptions. 

In this section I will concentrate on the case with one irreducible component 
in the Hilbert space, that is, I will neglect superselection rules. This is really no 
limitation, since transitions between different components are impossible. 

What I am going to prove is a result connecting two different perfect experi- 
ments in the same system. Assume that we know from the first perfect experiment 
that A^ = A. Next assume that we perform another perfect experiment b € A. In 
both cases, the notion of perfect measurement means that measurement error can 
be neglected. More realistic experiments are treated in Theorems 7 and 8 below. 
In the perfect case it turns out that we can find a formula for 


PP = ALIA = A) = P(A? =A, JAS = AL) 


STATISTICS AND QUANTUM MECHANICS 65 


which depends only upon the state vectors v? and v?. 

This formula has a large number of important consequences in quantum me- 
chanics and, as already said, it can be argued for in different ways. I will prove it 
from the following: 


ASSUMPTION 8. (i) The transition probabilities exist in the sense that the 
probabilities above do not depend upon anything else. 

(ii) The transition probability from A^ = A, in the first perfect experiment to 
A? = Ay in the second perfect experiment is 1. 

(ui) For all a, b, c we have that u($) = A" (ġ8gbc) is a valid experimental para- 
meter. 

(iv) For all a, b, c, i, k we have 


P (A^ ($) = A JAF (P) = Ax) = P(A? (gb) = X, IM (Bbc) = Ak). 


REMARK. (1) Assumption 8 is an important instance where the symmetry 
group setting is used in an essential way to derive a result that does not itself 
involve the symmetry group G. 

(2) Crucial assumptions will also be Assumption 3, that a common sample 
space can be used in all experiments, and Assumption 5. 

(3) We have A? (gnc) = A*(Q), so three experimental parameters are included 
in Assumption 8. 

(4) In the proof below we transform a single experiment by some element of G. 
The use of the transformation g on f is then justified by: 


LEMMA 3. Consider the homomorphism from the sample space transforma- 
tions to the parameter space transformations given by 
P^$(y € B) = P^(y e Bg ) = P*(yg e B). 
When y — t is a complete sufficient statistic, this is an isomorphism, so that one 
can let g be defined on the parameter space to begin with. 
PROOF. Assume that there are group elements g1 and g2 of two different sam- 
ple group transformations such that 
P^ (t € B) = P? (tg; € B) = P^ (tg; € B). 
Then for all X and for all functions h we have 
E^ (h(tg1)) = E^ (h(tg2)). 
By the definition of a complete sufficient statistic it then follows that tg; = tg. 
Lj 


Born's formula is given by: 


66 I. S. HELLAND 


THEOREM 5. Under the assumptions above and the assumptions of Theorem 4 
the transition formula is as follows: 


(19) PO, = M JA? = Ag) = o£! ot]. 


The proof will depend upon a recent variant [17, 18] of a well-known mathe- 
matical result given by Gleason [30]. One advantage of this recent variant is that it 
also is valid for dimension 2, when the ordinary Gleason theorem fails. 


THE BUSCH-GLEASON THEOREM. Consider any Hilbert space H. Define 
the set of effects as the set of operators on this Hilbert space with eigenvalues in 
the interval [0, 1]. Assume that there is a generalized probability measure x on 
these effects, that is, a set function satisfying 


n(E)>0 forall E, 
m1) = 1, 
X x(Ei)=x(E) for effects E, with sum E. 


Then x is necessarily of the form zt (E) = tr(p E) for some positive, self-adjoint, 
trace | operator p. 


The effects involved in the Busch-Gleason theorem turn out to have a rather 
straightforward statistical interpretation. Look at an experiment b, corresponding 
to a parameter A? which can take the values A,. Let the result of this experiment be 
given by a discrete complete sufficient statistic t, thus allowing for an experimental 
error. Let t have a likelihood 


pi(t) = P (tA? =à). 


The choice of experiment b, the set of possible parameter values {A,} and the 
result ¢ again constitute a question-and-answer set, but now in a more advanced 
form. The point is that the answer is uncertain, so that all these elements together 
with the likelihood function must be included to specify the question-and-answer. 


PROPOSITION 4. Exactly this information, the experiment b, the possible an- 
swers and the statistic t can be recovered from the effect defined by 


(20) E =~ p (bv. 


On the other hand, for fixed t every effect E can be written in the spectral 
form (20). 


STATISTICS AND QUANTUM MECHANICS 67 


PROOF. This is a spectral decomposition from which the eigenvalues p,(t) 
and the eigenvectors v? can be recovered. As discussed before, the eigenvectors 
correspond to the question-and-answers for the case without measurement errors, 
and from the likelihood the minimal sufficient observator ¢ can be recovered. The 


last part is obvious. |] 


All this was discussed from a slightly different perspective in [35] for the case 
of a two-dimensional Hilbert space. 

Consider now the situation where a quantum system is known to be in a state 
given by vf, that is, a perfect experiment a has been performed with result A^ = Ax. 
Then make a new experiment b, but let this experiment be nonperfect. We require 
the probability yr (E) that the result of the latter experiment shall be t, correspond- 
ing to the effect E given by (20). For this situation it is natural to define 


(21) mie» POPC SS AD 

An important point in our development is that under Assumption 8, this 7, when 
ranging over all the effects E, will be a generalized probability. The crucial result 
is the following: 


PROPOSITION 5. Under Assumption 8, if E,, E? and E, + E» all are effects, 
then 


(E, + E2) = x (E1) + x (E). 


PROOF. Let E; — E be given by (20), and let 


Ez = 45 (tutus 
j 


for another experiment c with another likelihood qj. 

First we remark that the relations x (r E1) = rz (E1) and x (Ej + E2) = x (E1) + 
zt (E2) are trivial when E1, E2, r E; and E; + E» are all effects and all vf = v^. 

We now turn to the general case. The statistic t may then be assumed to be 
sufficient and complete with respect to both likelihoods. By Assumption 5 the pa- 
rameters of the two experiments are connected by a group transformation. Then 
by imitating the argument in the proof of Lemma 3, a complete sufficient statis- 
tic for experiment b can be transformed by an isomorphic group transformation 
to a complete sufficient statistic for experiment c; hence the complete sufficient 
statistics for the two experiments may be assumed identical. 

Consider the experiment E34 defined by selecting experiment E; with proba- 
bility 1/2 and experiment E» with probability 1/2. Since the same measurement 
apparatus was used in both experiments, one can arrange things in such a way 
that the person reading ¢ for experiment E3 does not know which of the experi- 
ments E; or E? was chosen. This arrangement is necessary in order to avoid the 


68 I. S. HELLAND 


result that the conditionality principle should disturb our argument for this situa- 
tion; see [3] and the response to these comments. We can regard £3 as a genuinely 
new experiment here. 

Now use Assumption 5. From this assumption there exists a group element gp, 
such that A°(¢) = A? (Ógpc). We can, and will, rotate experiment b in such a way 
that all final state vectors coincide with those of experiment c. Then from As- 
sumption 8, the transition probability to experiment E^ is the same as if a rotated 
initial state was chosen and the state vectors v? were chosen, but with a different 
likelihood q;(t) = qi (t 8bc). 

From this perspective, the experiment £3 can also be related to the same state 
vectors, but with a likelihood 


(22) ri(t) = } (p: t) +410). 
The statistic ¢ will be sufficient relative to this likelihood, but may not be complete 
or minimal. However, this is not needed for our argument. 
This gives 
(23) n (E3) = $1 (E1) + 3x (E) 


for experiments transformed to have the same final states. 

We can now transform back so that all three experiments have the same initial 
state. Since experiment E3 in the rotated form had the same question-and-answer 
form as the other two experiments, only with a different likelihood (22), this ex- 
periment must also correspond to some effect. Then from (23), Assumption 8 and 
the fact that the same sample space is used for all three experiments both in the 
original and in the rotated version, the transition probability must satisfy 


(24) n (E3) = x (505 + E2)) = in (E1) + 3 (E2). 


The first equality here obviously holds in the rotated case; then it also holds when 
we rotate back. If E; + E» is an effect, the factor 1/2 can be removed throughout 
by suitably redefining the likelihood. U 


PROPOSITION 6. For fixed initial state A^ = Ak, the set function defined 
by (21) from the transition probability will under Assumption 8 be a generalized 
probability on the final effects. 


PROOF. The additivity property for a finite number of effects follows by in- 
duction from Proposition 5. The argument of Proposition 5 can also be used with 
a countable set of effects, so the additivity property for generalized effects follows 
for these set functions. 

It is obvious that zx (E) > 0. The limiting effect 7 corresponds to an experiment 
and experimental result with likelihood 1 on each single parameter value, and it is 
clear that the transition probability to this effect must be 1 from every initial state. 

[1 


STATISTICS AND QUANTUM MECHANICS 69 


PROOF OF THEOREM 5. Fix a and k and hence the state UE; interpreted as 
À^ = hy. Define qa, k(v) = %a,4(E) to be equal to the transition probability from 
v? to the effect E = vv! for an arbitrary state vector v, assumed to exist in As- 
sumption 8. Generalize to any E by (21). By Proposition 6 the conditions of the 
Busch—Gleason theorem are satisfied. 

By this theorem, for any v € H, we have Ta k(vv!) — v! pv for some p, which 
is positive, self-adjoint and has trace 1. This implies p = 5; cju; u! for some 
orthogonal set of vectors {u ,}. Self-adjointness implies that each c, is real-valued, 
and positivity demands c, > 0 for each j. The trace 1 condition implies ? /; c; = 1. 

Inserting this gives Ta k(vv!) =}; C) jutu, |^. Specialize now to the particular 
case given by v = vj for some k. For this case one must have 5; c, luftu, |^ — 1, 
and thus 


Y cju - Ig ul) =0. 
J 


This implies for each j that either c, = 0 or justu jl = 1. Since the last condition 
implies u, = v? (modulus an irrelevant phase factor), and this is a condition which 
only can be true for one j, it follows that c, = 0 for all j other than the one leading 
tou, = vr, and c, = 1 for this particular j. Summarizing all this, we get p = v; uf! 
and Theorem 5 follows. | 


A new challenge is of course to investigate to what extent this result, in fact all 
the results here from Section 11 onward, generalize to the case of parameters tak- 
ing more than a countable set of values. This will possibly require more advanced 
mathematical tools, but in that case it also seems quite certain that one can draw 
on known advanced results from quantum probability. 

The results above are valid and have relevance also outside quantum theory. In 
Section 12.5 of [35] a large-scale example is sketched where, using Born's for- 
mula, the prior probability of a second experiment 1s found, given the result of a 
first experiment. 

By the same proof, Born’s formula can be generalized to P(E|A? = Ay) = 
v E v; for an arbitrary final effect E [also Theorem 7(i) below]. This gives a 
transition probability from any state vector vý € H. 

Recall that H was originally defined using perfect experiments. Using Born's 
formula, it can be seen that a large class of experiments take the same Hilbert 
space as a point of departure. 


13. Basic formulae of quantum mechanics and of quantum statistics. Our 
state concept may now be summarized as follows: To the state A^(-) = A, there 
corresponds the state vector vr, and these vectors determine the transition proba- 
bilities as in (19). The probability distribution (19) also implies for perfect experi- 
ments: 


70 I. S. HELLAND 


THEOREM 6. (a) EĜÈJA? = Ay) = vg Tut, where T? — Y: Ajuba”. 
(b) ECF ĜENA = Ag) = ug! f (T^v, where f(T^) =E f. vtt oF 


Thus, in ordinary quantum-mechanical terms, the expectation of every observ- 
able in any state is given by the familiar formula. 

It follows from Theorem 6(a) and from the preceding discussion that the first 
three rules of Isham ([39], page 71), taken there as a basis for quantum mechanics, 
are satisfied. The fourth rule, the Schródinger equation, will be discussed in [36]. 

Now turn to nonperfect experiments. In ordinary statistics, a measurement is a 
probability measure P^ (dy) depending upon a parameter 0. Assume now that such 
a measurement depends upon the parameter 4”, while the current state is given by 
A? = Ax. Then as in Theorem 6(b): 


THEOREM 7. (a) Corresponding to the experiment b € A one can define an 
operator-valued measure M by M (dy) =, p% (dy)v^v"!. Then, given the initial 
state X° = Ax, the probability distribution of the result of. experiment b is given by 
P[dy|À* = Ag] = v? M(dy)v?. 

(b) These operators satisfy M{S| = I for the whole sample space S, and fur- 
thermore 3; M(A,) = M(A) for any finite or countable sequence of disjoint ele- 
ments (A1, A2,...} with A =|]; A 


Theorem 7(b) is easily checked directly. 
A more general state assumption is a Bayesian one corresponding to this setting. 
From Theorem 7(a) we easily find: 


THEOREM 8. Let the current state be given by probabilities 7t (A.X) for different 
values of Ay. Then, defining p = ? T (Ag) ve Ur. we get P[dy] — tr[o M (dy)]. 


A density operator p of such a kind is often used in quantum mechanics; the 
definition above gives a precise interpretation. In fact, these results are the basis 
for much of quantum theory, in particular for the quantum-statistical inference 
in [7]; for a formulation, see also [39]. 

Note that the density matrix vv?! i 


a density matrix v y Ap is equivalent to the statement that a perfect measurement 


is equivalent to the pure state v? ; similarly, 


giving A? = =A, lias Ae been performed. By straightforward application of Born’s 
formula one Pep 


THEOREM 9. Assume an initial state v; , and assume that a perfect measure- 
ment of 0” has been performed without knowing that value. Then this state is de- 
scribed by a density matrix 5^, lug! v?| Ae up. 


STATISTICS AND QUANTUM MECHANICS 71 


This is related to the celebrated and much discussed projection postulate of 
von Neumann. Writing P, — wu and p — vive! here, the jth term in the last 
formula can be written P, pP,, which corresponds to a special case of the Dirac- 
von Neumann formula [57]. 

In general we have assumed for simplicity in this section that the state vectors 
are nondegenerate eigenvectors of the corresponding operators, meaning that the 
parameter A^ contains all relevant information about the system. This can be gen- 


eralized, however. 


14. The electron revisited. The electron spin is in a way the simplest possi- 
ble quantum-mechanical system. The Hilbert space H is two-dimensional. H can 
fruitfully be regarded as an irreducible representation space of the rotation group. 
This group can be generated by the matrices o,, oy and o; given by (3). 

In the standard quantum-mechanical formulation these three matrices are taken 
as basic quantities, observables corresponding to the spin in the x-, y- and 
z-directions, respectively. They have all eigenvalues +1, corresponding to the val- 
ues of these spin observables. The corresponding eigenvectors are then taken as 
state vectors for these (perfect) measurement results. 

As a generalization, the observable T^ = ay0, + dyoy + a;0; for a real-valued 
unit vector a = (ax, dy, a;) also has eigenvalues +1, and the eigenvectors have a 
similar state vector interpretation, corresponding to a spin vector in the direction a. 

The transition probabilities between states defined by spin in different directions 
are found from the Born formula, from which (5) is derived. 

A more direct representation of the spin state of an electron was discussed 
in [35]. In agreement with the alternative representation of quantum mechanics 
proposed in the present paper, start with a spin vector $ and choose a direction a 
in which the spin component shall be measured. As in Section 6 it is only possible 
to measure A^ = sign(0^) = sign(@ - a). 

Define the 3-vector u = A^a. We claim that this vector gives a unique repre- 
sentation of the spin state of the electron. As has now been stressed repeatedly, 
we regard the state as a question-and-answer pair. The question (what is the spin 
component in direction a?) is given by the chosen vector a; the answer is given 
by A^. We can recover both these elements uniquely from the vector u, since a 
Spin component — 1 in the direction a is equivalent to a spin component +1 in the 
direction —a. 

For those knowing some quantum mechanics, the spin state can also be repre- 
sented by the Bloch sphere or Poincaré sphere matrix 


p 3 +u-0), 


where o is a formal 3-vector with components given by the 2-by-2 matrices Ox, oy 
and c; above. Obviously, specifying p is equivalent to specifying u. 

Finally, by conventional quantum mechanics we have o — vv!, where v is the 
ordinary complex two-dimensional Hilbert space state vector, only defined modulo 


72 L S. HELLAND 


an arbitrary phase factor for an isolated system. Thus the spin state can be given in 
any of four different ways: 


(1) as a question a together with an answer A^; 
(2) by the 3-vector u; 

(3) by the Bloch sphere matrix p; 

(4) by the Hilbert space state vector v. 


The discussion here can be generalized to other density matrices and further to 
the effects of Section 12; see [35]. 


15. Discussion. The treatment of quantum theory given in this paper, is of 
course still not complete. In [36] two further themes will be discussed from the 
present point of view, namely the spacetime structure (including transformations 
related to Planck's constant) and the Schródinger equation, which gives the time 
development of the state vector. 

Our point of departure here is that both quantum theory and statistical theory 
deal with prediction, both using probability models of some kind. In our view, what 
we have arrived at seems to point at a general theory from which both traditional 
statistical theory and quantum theory emerge as special cases. 

À basic premise is that the states of quantum mechanics are related to the para- 
meter space of statistical models. This is an assumption that we have in common 
with other authors, for instance, Caves, Fuchs and Schack [19]. Hidden variable 
models for quantum mechanics have been criticized in many contexts. In my view, 
a hidden (total) parameter model is a more flexible and useful concept. A hidden 
parameter does not in general have a value; in a given situation it can be looked 
upon more as part of the conceptual framework needed to describe the situation. 
Only by focusing on some given function of the hidden total parameter can we 
obtain a concrete parameter on which inference can be made from specific experi- 
ments. 

We allow the choice between several complementary experiments/questions on 
the same units. Furthermore, we impose symmetry conditions of the form often 
done in statistics, but more complicated because of the choice of experiment. Fi- 
nally, we allow model reduction using the orbit index of the experimental symme- 
try group. This leads to essential parts of quantum theory, and we find that the set 
of functions of complete sufficient statistics for the experiments essentially deter- 
mines the Hilbert space needed for the quantum formulation. 

Large parts of the present theory should in principle be valid on a macroscopic 
scale, too. This leads to the question of whether large-scale situations can be found 
which can be related in some way to this theory. Some brief examples of related 
applications can be mentioned. 

As an example of partly complementary parameters, look at different sets of or- 
thogonal contrasts in an analysis of variance situation. In randomized experiments 


STATISTICS AND QUANTUM MECHANICS 73 


we have a symmetry group on the sample space leading to calculations [4] which 
in fact have some formal resemblance to those of quantum theory. 

With moderately complicated issues for a statistical investigation, it is always 
wise to elucidate the issue in question from several angles. This may involve per- 
forming experiments with different, but related parameters and making inference 
on different, but related parameters. A related case is conditioning on different 
ancillary statistics, where a connection to quantum theory was hinted at in [5]. 

In [33] it is shown that existing chemometric prediction methods can be related 
to rotational symmetry combined with a model reduction of the kind discussed in 
this paper. 

Thus the theory developed here may seem to have something to say to current 
applied statistics. These questions must wait for further developments, however. 

John von Neumann once said: “In mathematics you don’t understand things. 
You just get used to them" (cited from [11]). By now, generations of physicists and 
mathematicians have got ten used to the formal Hilbert space approach to quantum 
theory. And important results have followed from this, both applied and theoreti- 
cal; some of the latter are mentioned in the Introduction. This gives overwhelming 
evidence that quantum theory is important and useful. But this in itself does not 
prove that the ordinary logical foundation for the theory is the simplest one. Our 
claim is the following: Physics is basically an empirical science, and hence one 
should work for, instead of a logical foundation suggested by formal mathematics, 
one that is related to quantitative methodology used by other empirical sciences. 
This has been some of the motivation behind the present work, and the results 
obtained seem to confirm that such a link is possible. 


APPENDIX 


A.1. Further properties of group actions. Adding a group to a statistical 
model specification 1s often of interest, and does have consequences; see [42]. First 
let a group G act on a measurable sample space S. Measurability questions are 
ignored here, as is common when discussing transformation groups; a full account 
of this aspect 1s given in [56]. 

The orbits of a group G acting on S are the sets of the form wog, where wp is 
fixed and g runs through G. The orbits of the parameter group induced from G 
by (2) are defined similarly. Under conditions as given below, each set of orbits 
can be given an index. The orbit index in the sample space will always have a 
distribution which depends only upon the orbit index in the parameter space. 

Concentrate now on the group G acting on the total parameter space ®. Similar 
concepts can be defined for the other group actions discussed above. The group G 
is also assumed to have a topology. 

We assume, as is commonly done, that the group operations (gi, 22) > 2122 
and g +» g`! are continuous. Furthermore, we will assume that the action 
(e, $) +> ġe is continuous for $ € ®. An additional condition, discussed in [61], 


74 I. S. HELLAND 


is that every inverse image of compact sets under the function (g, ¢) œ ($$, ¢) 
should be compact. A continuous action by a group G on a space © satisfying this 
condition is called proper. This technical condition turns out to have useful prop- 
erties and is assumed throughout this paper. When the group action is proper, the 
orbits of the group can be proved to be closed sets relative to the topology of ®. 

For fixed $ € ®, a stability subgroup H of G is defined as {h : dh = 9). These 
are transformed within orbits of Gas H > g7! Hg. 

Every locally compact group possesses a right-invariant Haar measure v sat- 
isfying v(Dg) — v(D) for D c G [46]. This induces a right-invariant measure 
on Ẹ itself if each stability group H is compact, which is the case if the action G 
on ® is proper and the group is locally compact. The last assertion is proved in 
([61], Theorem 2.3.13(c)). A right-invariant measure v on ® satisfies by definition 
vV(Fg) = v(F) for all (measurable) F C P and g € G. 


A.2. On group representation theory. A matrix representation of a group G 
is defined as a function U from the group to the set of (here complex) matrices 
satisfying U (gh) = U(g)U (h) for all e, h € G. In other words, a representation 
is a homomorphism from G to the multiplicative group of square matrices of a 
fixed dimension. Any representation U and any fixed nonsingular matrix K of the 
same size can be used to construct another representation $(g) = KU(g)K7—!. If 
the group 1s compact (and also in some other cases), we can always find such 5 
of minimal block diagonal form, and at the same time we can take S to be unitary 
[S(g)! S(g) = I]. If (and only if) the group is Abelian, each minimal block will be 
one-dimensional. 

An important aspect of this reduction appears if we look upon the matrices 
as operators on a vector space: Then each collection of blocks gives an invariant 
vector space under the multiplicative group of matrices, and each single minimal 
block gives an irreducible invariant vector space. For compact groups, the irre- 
ducible invariant vector spaces will be finite-dimensional. The minimal matrices 
in the blocks are called irreducible representations of the group. 

More generally, a class of operators {U (e); g € G} (where G is a group) 
on a, possibly infinite-dimensional, vector space is a representation if U(gh) = 
U (g)U (A) for all g, h. A representation of a compact group always has a complete 
reduction in minimal matrix representations as described above. In particular, this 
holds for the unitary regular representation defined on a Hilbert space L*(®, v) by 
Un(g) f ($) = f(óg). Here v is the right-invariant measure for G on d. 

A useful result is Schur’s lemma: 

If U and U' are irreducible representations, and A is a bounded linear map such 
that U(g)A = AU'(g) for all g, then either U and U’ are isomorphic or A =Q. If 
U (g)A = AU (g) for all g, then necessarily A = AJ for some scalar À. 

More on group representations can be found in [9, 23, 31, 40, 47, 55, 62]. 


STATISTICS AND QUANTUM MECHANICS 75 


Acknowledgments. First of all I want to thank Richard Gill for numerous dis- 
cussions, for thoughtful comments and for exchange of ideas in general. Also, ad- 
vice and helpful remarks by Chris Isham, Peter Jupp, Peter McCullagh and Erling 
Stgrmer are appreciated. 

Much of the inspiration behind this paper was provided by three conferences 
arranged by Andrei Khrennikov in Växjö, Sweden. 


REFERENCES 


[1] ACCARDI, L. (1976). Nonrelativistc quantum mechanics as a noncommutative Markov 


process. Advances in Math. 20 329-366. MR0484170 


[2] ACCARDI, L. and HEYDE, C. C., eds. (1998). Probability Towards 2000. Lecture Notes in 


[3] 


[4] 


[9] 
[10] 
[11] 
[12] 
[13] 
[14] 
[15] 


[16] 
[17] 


[18] 


[19] 


Statist. 128. Springer, New York. MR1632689 

AITKIN, M. (1996). Comment on “Simple counterexamples against the conditionality prin- 
ciple,” by I. S. Helland [Amer. Statist. 49 (1995) 351—356]. Amer. Statist. 50 384—385. 
MR 1368487 

BAILEY, R. A. (1991). Strata for randomized experiments (with discussion). J. Roy. Statist. 
Soc. Ser. B 53 27-18 MR1094275 

BARNDORFF-NIELSEN, O. E (1995). Diversity of evidence and Birnbaum's theorem (with 
discussion). Scand. J. Statist. 22 513—522. MR1363227 

BARNDORFF-NIELSEN, O. E. and GILL, R. D. (2000) Fisher information in quantum statis- 
tics. J. Phys. À 33 4481—4490. MR1768745 

BARNDORFF-NIELSEN, O., GILL, R. D. and JUPP, P. E. (2003). On quantum statistical infer- 
ence (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 65 775—816. MR2017871 

BARNUM, H., CAVES, C. M , FINKELSTEIN, J., FUCHS, C. A. and SCHACK, R. (2000). 
Quantum probability from decision theory? Proc. Roy. Soc. Lond. Ser. A 456 1175-1182. 
MR1809958 

BARUT, A. O. and RACZKA, R (1985). Theory of Group Representations and Applications, 
2nd ed. Polish Scientific Publishers, Warsaw. MR0495836 

BELAVKIN, V. P. (2000). Quantum probabilities and paradoxes of the quantum century. Infin. 
Dimens. Anal. Quantum Probab. Relat Top 3 577—610. MR1805845 

BELAVKIN, V. P. (2002). Quantum causality, stochastics, trajectories and information. Avail- 
able at arxiv.org/abs/quant-ph/0208087. 

BELTRAMETTI, E. G. and CASSINELLI, G. (1981). The Logic of Quantum Mechanics. 
Addison-Wesley, Reading, MA. MR0635780 

BOHM, D. (1952). À suggested interpretation of the quantum theory in terms of "hidden" 
variables I. Phys. Rev. 85 166-179. MR0046287 

BOHR, A. and ULFBECK, O. (1995). Primary manifestation of symmetry. Origin of quantal 
indeterminacy. Rev. Modern Phys. 67 1-35. MR1328825 

BOHR, N. (1935). Can quantum-mechanical description of physical reality be considered com- 
plete? Phys. Rev. 48 696—702 

BOHR, N. (1958). Atomic Physics and Human Knowledge. Wiley, New York. 

BUSCH, P. (2003). Quantum states and generalized observables: A simple proof of Gleason's 
theorem. Phys. Rev. Lett. 91 120403. MR2037239 

CAVES, C. M., FUCHS, C. A., MANNE, K. K. and RENES, J. M. (2004). Gleason-type 
derivations of the quantum probability rule for generalized measurements. Found. Phys. 
34 193—209. Available at arxiv.org/abs/quant- ph/0306179. MR2054121 

CAVES, C. M., FUCHS, C. A. and SCHACK, R. (2002) Quantum probabilities as Bayesian 
probabilities. Phys. Rev. A 65 022305. Available at arxiv.org/abs/quant-ph/0106133. 


76 L S. HELLAND 


[20] CRAMER, J. C. (1986). The transactional interpretation of quantum mechanics. Rev. Modern 
Phys. 58 647—687. MR0854444 

[21] DAWID, A P. (2000). Causal inference without counterfactuals (with discussion). J. Amer 
Statist. Assoc. 95 407-448. MR1803167 

[22] DEUTSCH, D. (1999). Quantum theory of probability and decisions. Proc Roy. Soc. Lond. Ser. 
A 455 3129-3197 MR1807058 

[23] DIACONIS, P. (1988). Group Representations in Probability and Statistics. IMS, Hayward, CA. 
MR0964069 

[24] Dirac, P. A. M. (1947). The Principles of Quantum Mechanics, 3rd ed. Clarendon, Oxford. 
MR0023198 

[25] EVERETT, H. III (1957) "Relative state" formulation of quantum mechamoes. Rev. Mod. Phys 
29 454—462. MR0094159 

[26] FINKELSTEIN, J (1999). Quantum probability from decision theory? Available at arxiv.org/ 
abs/quant-ph/9907004. 

[27] Fucus, C. A (2002). Quantum mechanics as quantum information (and only a little more) 
Available at arxiv.org/abs/quant-ph/0205039. 

[28] GILL, R. D. (2005). On an argument of David Deutsch. In Quantum Probability and Infinite 
Dimensional Analysis (M. Schurmann and U. Franz, eds.) 277—292. World Scientific, 
Singapore 

[29] GILL, R D and ROBINS, J. M. (2001). Causal inference for complex longitudinal data Ann. 
Statist. 29 1785-1811. MR1891746 

[30] GLEASON, A (1957). Measures on the closed subspaces of a Hilbert space. J. Math. Mech. 6 
885—893. MR0096113 

[31] HAMERMESH, M. (1962) Group Theory and Its Application to Physical Problems. Addison- 
Wesley, Reading, MA. MR0O136667 

[32] HARDY, L. (2001). Quantum theory from five reasonable axioms. Available at arxiv.org/abs/ 
quant-ph/0101012. 

[33] HELLAND, I S. (2002). Discussion of “What is a statistical model?,” by P. McCullagh. Ann. 
Statist. 30 1286-1289. MR1936320 

[34] HELLAND, I. S. (2004). Statistical inference under symmetry. Internat Statist. Rev 72 409— 
422. 

[35] HELLAND, I. S (2005). Quantum theory as a statistical theory under symmetry. In Foundations 
of Probability and Physics 3 (A. Khrennikov, ed.) 127-149. Amer. Inst. Physics, Melville, 
NY. Revised version available at arxiv.org/abs/quant-ph/0411174. 

[36] HELLAND, L S. (2006). Quantum mechanics from statistical theory under symmetry, comple- 
mentarity and time evolution. Unpublished manuscnpt. 

[37] HELSTROM, C. W. (1976). Quantum Detection and Estimation Theory. Academic Press, New 
York. 

[38] HoLEVO, A. S. (1982) Probabilistic and Statistical Aspects of Quantum Theory. North- 
Holland, Amsterdam. MR0681693 

[39] ISHAM, C. J. (1995). Lectures on Quantum Theory. Imperial College Press, London 
MR1450867 

[40] JAMES, G and LIEBECK, M. (1993). Representation and Characters of Groups. Cambridge 
Univ. Press. MR1237401 

[41] LAURITZEN, S. (1996) Graphical Models. Oxford Univ. Press, New York. MR1419991 

[42] LEHMANN, E. L and CASELLA, G (1998). Theory of Point Estimation, 2nd ed. Springer, 
New York. MR1639875 

[43] MALLEY, J. D. and HORNSTEIN, J. (1993) Quantum statistical inference. Statist. Sci. 8 
433—457. MR1250150 

[44] MCCULLAGH, P. (2002). What 1s a statistical model? (with discussion). Ann. Statist. 30 1225— 
1310. MR1936320 


STATISTICS AND QUANTUM MECHANICS 77 


[45] MEYER, P.-A. (1993). Quantum Probability for Probabilists. Springer, Berlin. MR1222649 

[46] NACHBIN, L. (1965). The Haar Integral. Van Nostrand, Princeton, NJ. MR0175995 

[47] NAIMARK, M. A. and STERN, A. I. (1982). Theory of Group Representations. Springer, 
Berlin MR0793377 

[48] NEYMAN, J. (1923) On the application of probability theory to agricultural experiments. Essay 
on principles. Section 9. Ann. Agricultural Sciences 10 1—51. [Translated in Statist. Sci. 5 
(1990) 465—472]. MR1092986 

[49] PARTHASARATHY, K. R. (1992). An Introduction to Quantum Stochastic Calculus. Birkhauser, 
Basel MR1164866 

[50] PEARL, J. (2000). Causality, Cambndge Univ. Press. MR1744773 

[51] PETERSEN, A. (1985). The philosophy of Niels Bohr. In Niels Bohr. A Centenary Volume 
(A. P. French and P. J. Kennedy, eds.) 299—310 Harvard Univ. Press. 

[52] ROBINS, J. (1986). A new approach to causal inference in mortality studies with a sustained ex- 
posure period—Application to control of the healthy worker survivor effect. Math. Mod- 
elling 7 1393-1512. MR0877758 

[53] ROBINS, J. (1987). Addendum to “A new approach to causal inference 1n mortality studies 
with a sustained exposure period—Application to control of the healthy worker survivor 
effect" Comput. Math. Appl. 14 923-945. MR0922792 

[54] RUBIN, D. B. (1978). Bayesian inference for causal effects: The role of randomization. Ann. 
Statist 6 34—58. MR0472152 

[55] SERRE, J.-P. (1977). Linear Representations of Finite Groups. Springer, Berlin. MR0450380 

[56] VARADARAJAN, V. S. (1985). Geometry of Quantum Theory, 2nd ed. Springer, New York. 
MR0805158 

[57] VOLOVICH, I. V. (2003) Seven principles of quantum mechanics. In Foundations of Proba- 
bility and Physics 2 569—575. Växjö Univ. Press. MR2039739. Available at arxiv.org/abs/ 
quant-ph/0212126. 

[58] VON NEUMANN, J. (1932). Mathematische Grundlagen der Quantenmechanik Springer, 
Berlin. 

[59] VON NEUMANN, J. (1955). Mathematical Foundations of Quantum Mechanics. Princeton 
Univ. Press. MR0066944 

[60] WIGHTMAN, A S. (1976). Hilbert's sixth problem: Mathematical treatment of the axioms of 
physics. In Mathematical Developments Arising from Hilbert Problems (F. E Browder, 
ed.) 147—240. Amer. Math. Soc., Providence, RI. MR0436800 

[61] WIISMAN, R. À. (1990). Invariant Measures on Groups and Their Use in Statistics IMS, 
Hayward, CA. 

[62] WOLBARST, A. B. (1977) Symmetry and Quantum Systems. Van Nostrand Reinhold, New 
York. 


DEPARTMENT OF MATHEMATICS 
UNIVERSITY OF OSLO 

P O. Box 1053 BLINDERN 
N-0316 OSLO 

NORWAY 

E-MAIL: ingeh 8 math.uio no 


The Annals of Statistics 

2006, Vol 34, No 1, 78-91 

DOI 10 1214/009053606000000 155 

© Institute of Mathematical Statistics, 2006 


IMPROVED MINIMAX PREDICTIVE DENSITIES UNDER 
KULLBACK-LEIBLER LOSS! 


BY EDWARD I. GEORGE, FENG LIANG AND XINYI XU 


University of Pennsylvania, Duke University 
and Ohio State University 


Let X|u ~ Np(p, vx D) and Y|p ^ Np (yu, vy!) be independent p-dimen- 
sional multivariate normal vectors with common unknown mean p. Based 
on only observing X = x, we consider the problem of obtaining a predic- 
tive density p(y|x) for Y that is close to p(yl|i&) as measured by expected 
Kullback-Leibler loss A natural procedure for this problem ıs the (formal) 
Bayes predictive density Pu (y|x) under the uniform prior my (p) = 1, which 
is best invariant and minimax. We show that any Bayes predictive density 
will be minimax if it is obtained by a prior yielding a marginal that is super- 
harmonic or whose square root is superharmonic. This yields wide classes 
of minimax procedures that dominate py(y|x), including Bayes predictive 
densities under superharmonic priors. Fundamental similarities and differ- 
ences with the parallel theory of estimating a multivariate normal mean under 
quadratic loss are described. 


1. Introduction. Let X|u ^ Np(u, vz!) and Y|u ^ Np(p, v,I) be indepen- 
dent p-dimensional multivariate normal vectors with common unknown mean 4, 
and let p(x|u) and p(y|jz) denote the conditional densities of X and Y. We assume 
that v, and vy are known. 

Based on only observing X — x, we consider the problem of obtaining a pre- 
dictive density p(y|x) for Y that is close to p(y|). We measure this closeness by 
Kullback-Leibler (KL) loss, 


plu) 
P(y|x) 





(1) L(u, PC) = f pO|a)log 
and evaluate p by its expected loss or risk function 


(2) Rew (ut, B) = | p(xlp)L(u, BCIx)) dx. 


Received July 2003; revised March 2005. 
I Supported by NSF Grant DMS-01-30819 
AMS 2000 subject classifications. Pnmary 62C20; secondary 62C10, 62F15 
Key words and phrases. Bayes rules, heat equation, 1nadmissibility, multiple shrinkage, multivari- 
ate normal, prior distributions, shrinkage estimation, superharmonic marginals, superharmonic pri- 
ors, unbiased estimate of risk. 


78 


IMPROVED MINIMAX PREDICTIVE DENSITIES 79 


For the comparison of two procedures, we say that pı dominates p if Ri (uu, 
P1) < Ryi (Lu, P2) for all y and with strict inequality for some u. By a sufficiency 
and transformation reduction, this problem is seen to be equivalent to estimat- 
ing the predictive density of X„+ı under KL loss based on observing X1,..., Xn 
when X1,..., X441|p 11d. ~ Np (p, 22). For distributions beyond the normal, ver- 
sions and approaches for the KL risk prediction problem have been developed by 
Aslan [2], Harns [10], Hartigan [11], Komaki [12, 14] and Sweeting, Datta and 
Ghosh [24]. 

For any prior distribution z on à, Aitchison [1] showed that the average risk 
r(x, p) = f Rxv(u, p) (p) du is minimized by 


(3) ia Go = | pilam Qo du, 


which we will refer to as a Bayes predictive density. Unless x is a trivial point 
prior, pa (y|x) é (plu: u € RP}, that is, pz will not correspond to a “plug-in” 
estimate for yz, although under suitable conditions on z, p«4(y|x) > p(yl|p) as 
Uy. > Q. 

For this problem, the best invariant predictive density (with respect to the lo- 
cation group) is the Bayes predictive density under the uniform prior my(z) = 1, 
namely 


V2 
(4) fux) = ly =l | 


(2x (vx + vy))P/2 P |- 2(vy + vy) 


which has constant risk; see [18] and [19]. More precisely, one might refer to 
Pu as a formal Bayes procedure because zy is improper. Aitchison [1] showed 
that pu(y|x) dominates the plug-in predictive density p(y|&wrg) which simply 
substitutes the maximum likelihood estimate AMLE = x for u. As will be seen in 
Section 2, py is minimax for KL loss (1). That py is best invariant and minimax 
can also be seen as a special case of the more general recent results in Liang and 
Barron [17], who also show that py is admissible when p = 1 under the same loss. 

However, py is inadmissible when p > 3. Komaki [13] proved that when p > 3, 
Pu itself is dominated by the (formal) Bayes predictive density 


(5) fuil) = | p(ylu)an(ulx) dp, 
where 
(6) mH (HL) = lul 07 


is the (improper) harmonic prior recommended by Stein [21], which we subscript 
by “H” for harmonic. Although Komaki referred to xy as harmonic, his proof did 
not directly exploit this property. 


80 E. I. GEORGE, F LIANG AND X. XU 


More recently, Liang [16] showed that py is also dominated by the proper Bayes 
predictive density p4(y|x) under the prior st; (14) (see [23]) defined hierarchically 
as 


(7) uis ~ Np(0, svol), | se (124-5) 7. 


Here vo and a are hyperparameters. The conditions for domination are that 
vo > vx, and a € [0.5, 1) when p = 5 and a € [0, 1) when p > 6. Note that a 
depends on the constant vo in (7), a dependence that will be maintained through- 
out this paper. The harmonic prior ry is well known to be the special case of yra 
when a = 2. 

These results closely parallel some key developments concerning minimax es- 
timation of a multivariate normal mean under quadratic loss. Based on observing 
X |i ~ Np(h, I), that problem is to estimate u under 


(8) Rolu, Â) = Ep lÂ — wll’, 

where we have denoted quadratic risk by Ro to distinguish it from the KL risk 
Rx in (2). Under Ro, MLE = X is best invariant and minimax, and is admissible 
if and only if p < 2. Note that impr plays the same role here that py plays in our 
KL risk problem. A further connection between Ame and py is revealed by the 
fact that AMLE = Ex, (|x), the posterior mean of u under zu (yu) = 1. 

Stein [21] showed that Ay = E; (u|x), the posterior mean under zt, domi- 
nates AMLE when p > 3, and Strawderman [23] showed that fig = Ex, (ux), the 
proper Bayes rule under zg when v; = vo = 1, dominates mrg when a € [0.5, 1) 
for p — 5 and when a c [0, 1) for p 7 6. Comparing these results to those of 
Komaki and Liang in the predictive density problem, the parallels are striking. 
A principal purpose of our paper is to draw out these parallels in a more unified 
and transparent way. 

For these and other shrinkage domination results in the quadratic risk estima- 
tion problem, there exists a unifying theory that focuses on the properties of the 
marginal distribution of X under 7, namely 


(9) m«G) = | perum Go du. 


The key to this theory is the representation due to Brown [4] that any posterior 
mean of u, fiz = Eg (u|x), is of the form 
(10) Ôr =x + Vilogm,y (x), 


where V = (8/8x1,...,8/8xp)'. To show that £g dominates Amie, Stein [21, 22] 
used this representation to establish that Ro(u, AMLE) — Rolu, fix) = E,U(X), 
where 


2 
(11) U(X) = |V logm; (X)? —2- 2D 
Mya (X) 
2 f 
(12) = 4Y _vMa(A) 


mg (X) 


Xi 


IMPROVED MINIMAX PREDICTIVE DENSITIES 81 


is an unbiased estimate of the risk reduction of f, over fur g, where V2m, (x) = 
Y ams). 

Because {imix is minimax, it follows immediately from (11) that V^m4 (x) x 0 
is a sufficient (though not necessary) condition for jz, to be minimax, and as long 
as my (x) is not constant, for ji, to dominate fme. [Recall that a function m(x) 
is superharmonic when V^m(x) < 0.] The fact that fig dominates faye when 
p = 3 now follows easily from the fact that nonconstant superharmonic priors [of 
which the harmonic prior zrg (44) is of course a special case] yield superharmonic 
marginals My for X. 

It follows from (12) that the weaker condition V*./m, (x) x 0 is sufficient for 
fix. to be minimax, although strict inequality on a set of positive Lebesgue mea- 
sure is then needed to guarantee domination over AMLE. Fourdrinier, Strawderman 
and Wells [6] showed that the Strawderman priors zg in (7) yield superharmonic 
mis , SO that the minimaxity of the Strawderman estimators is established by (12). 
In fact, it follows from their results that 74 also yields superharmonic ,/m; when 
a € [1, 2) and p > 3, thereby broadening the class of formal Bayes minimax esti- 
mators. 

One major aim of the present paper is to establish an analogous unifying theory 
for the KL risk prediction problem. Paralleling (10), we begin by showing how any 
Bayes predictive density p; can be explicitly represented in terms of py and the 
form of the corresponding marginal my. Coupled with the heat equation, Brown's 
representation and Stein's identity, this representation is seen to lead to a new 
identity that links KL risk reduction to Stein's unbiased estimate of risk reduction. 
Based on this link, we obtain sufficient conditions on m, for minimaxity and dom- 
ination of pz over pu. These general conditions subsume the specialized results 
of Komaki [13] and Liang [16] and can be used to obtain wide classes of improved 
minimax Bayes predictive densities including py and pa. Furthermore, the under- 
lying priors and marginals can be readily adapted to obtain minimax shrinkage 
toward an arbitrary point or subspace, and linear combinations of superharmonic 
priors and marginals can be constructed to obtain minimax multiple shrinkage pre- 
dictive density analogues of the minimax multiple shrinkage estimators of George 
[7—9]. Thus, the parallels between the estimation and the prediction problem are 
broad, both qualitatively and technically. The main contribution of this paper is to 
establish this interesting connection. 


2. General conditions for minimaxity. In this section we develop and prove 
our main results concerning general conditions under which a Bayes predictive 
density p; (y|x) in (3) will be minimax and dominate py(y|x). We begin with 
three lemmas that may also be of independent interest. The following general no- 
tation will be useful throughout. For Z|u ^ N (qu, vI) and a prior z on u, we 


82 E. I. GEORGE, F. LIANG AND X. XU 
denote the marginal distribution of Z by 


(13) ma (zv) = J plur (i) dp. 


In terms of this notation, the marginal distributions of X|u ^ Np(u, v, T) and 
Y |p ~ Np(t, vy I) under x are then m, (x; v) and m; (y; vy), respectively. 


LEMMA l. If m,(z; vy) is finite for all z, then for every x, By (y|lx) will bea 
proper probability distribution over y. Furthermore, the mean of px (y|x) is equal 
to Ex (|x). 


PROOF. Both claims follow by integrating (3) with respect to y and switching 
the order of integration using the Fubini-Tonelli theorem. 1! 


Lemma 1 is important because, for our decision problem to be meaningful, it 
is necessary for a predictive density to be a proper probability distribution. By the 
laws of probability, a Bayes predictive density pz (ylx) will be a proper proba- 
bility distribution whenever xr (4) is a proper prior distribution. But by Lemma 1, 
improper z (u) can still yield proper p; (y|x) under a very weak condition. 

Our next lemma establishes a key alternative representation of p,(y|x) that 
makes use of the weighted mean 


wate 
Ux + Vy 
Note that W would be a sufficient statistic for u if both X and Y were ob- 


served. As X and Y are independent (conditionally on 4), it follows that 
Wu ^ Np(p, vwt) where 


(14) 


UxU 
Vy — a aU. 
Ux + Uy 


The marginal distribution of W is then mz (w; vy). 


LEMMA 2. For any prior x (u), Dx (y|x) can be expressed as 


(15) ECT NAAR 
My (x; Vy) 


where py(y|x) is defined by (4). Furthermore, the difference between the KL risks 
of PuQy|x) and Pyr (y|x) is given by 
Rete, Pu) C Rete, Pr) 
(16) 
= Env, log mz (W; vy) — E,,v, log mg (X; vx), 


where E, v(-) stands for expectation with respect to the N (u, vI) distribution. 


IMPROVED MINIMAX PREDICTIVE DENSITIES 83 


PROOF. The joint marginal distribution of X and Y under z is, 
pa. y) = | pap du 
_ I lx — ull? 
ü J (Or v,)P/2 ap] 2. 


Dear 


X Gayo | Zv, lzudu 


=f 1 ap - 2721 
(2x (v, 4c v,))P/2 PU 20, d- vy) 


C — Mna 
(27 vy)P? 2Uw 


= pu(y|x)msa (W; vy). 


uc dy 


The representation (15) now follows since px (y|x) = pg (x, y)/ ma (x; vx). 
To prove (16), the KL risk difference can be expressed as 
Px (y|x) 


RkL (4, Pu) — REL, Pr) = | | p(x|u) pO lu) log 7——— dx dy 
pu |x) 


mg (w; Vy) dx dy 


= f [ reus uie: = | 


where the second equality makes use of (15). The second expression in (16) is seen 
to equal this last expression by the change of variable theorem. LI 


Paralleling Brown's representation (10), representation (15) reveals the ex- 
plicit role played by the marginal distribution of the data under zr. Analogous 
to Bayes estimators E, (ux) of u that "shrink" mir = x, this representation 
reveals that Bayes predictive densities p,(y|x) "shrink" py(y|x) by a factor 
mg (w; vy)/mg (x; vx). However, the nature of the shrinkage by ps (y|x) is dif- 
ferent than that by Ej (u|x). To insure that p, (y|x) remains a proper probability 
distribution, the factor My (W; v)/ mg (x; vx) cannot be strictly less than 1. In con- 
trast to simply shifting Avie = x toward the mean of x, px (y|x) adjusts Pu(yl|x) 
to concentrate more on the higher probability regions of x. Figure 1 illustrates 
such shrinkage of pu(y|x) by Pu(y|x) in (5) when v, = 1, vy = 0.2 and p — 5. 

For our purposes, the principal benefit of (15) is that it reduces the KL risk 
difference (16) to a simple functional of the marginal my (z; v). As will be seen in 
the proof of Theorem 1 below, (16) is the key to establishing general conditions for 
the dominance of Py over py. First, however, we use it to facilitate a simple direct 
proof of the minimaxity of py, a result that also follows from the more general 
results of Liang and Barron [17]. 


(4, 0, 0, 0, 0) 


x= 


(3, 0, 0, 0, 0) 


P tcm 


(2, 0, 0, 0, 0) 


x= 


E I. GE 
E F 


TO 
PAS 
Renee’ 


* 
oes CI 


L| 
(x tH 
4t 
att t 
er VM 





IMPROVED MINIMAX PREDICTIVE DENSITIES 85 


COROLLARY 1. The Bayes predictive density under x (u) = 1, namely pu, is 
minimax under Ry. 


PROOF. By a transformation of variables, x — (x — u) and y — (y — p), it 
is easy to see that Ri (u, pu) = Ry (0, pu) =r for all u, so that Ri (uu, pu) 
is constant. Next, we show that r is a Bayes risk limit of a sequence of Bayes 
rules Pr, with 7n (H) = Np(0, 021), where o? — oo as n > oo. By the fact that 
T(n, Pu) =r and (16), 


Es r (Tn, Dx) = 'EZ (GOLES vu log mz, (W; vw) 
(17) 
zm Ey v, log my, (X; vz) | du, 
where 
o kÊ 

2(v 4-02] 
It is now easy to check that (17) = O(1 jd) and hence goes to zero as n goes to 
infinity. By Theorem 5.18 of [3], the minimaxity of py follows. O 


My, (Z; vV) = (2x (v + g2)? exp] 


Our next lemma provides a new identity that links E, y log My (Z; v) to Stein's 
unbiased estimate of risk reduction U (x) in (11) and (12) for the quadratic risk 
estimation problem. When combined with (16) in Theorem 1, this identity will be 
seen to play a key role in establishing sufficient conditions on m, for p, to be 
minimax and to dominate py. 


LEMMA 3. If my(z; vy) is finite for all z, then for any vy € v € vy, mg (z; v) 
is finite. Moreover, 
V2mx(Ziv) 1 2 
————— — —||Vlo Z; 
rir — glViogma (Z; If) 
V? ms (Z; 2) 
Am; (Z; v) 


PROOF. When m;,(z; v.) is finite for all z, it is easy to check that for any fixed 
z and any vy < V € vx, 


ð 


(19) = E, (2 


v, VP/? 
My (z; v) < (=) My (Z; Ux) < oo. 
wW 


Letting Z* = (Z — u)/vv ~ N (0, I), we obtain 
ð ð 
— E u,v 10g My (Z; v) = 3; E logms (V vZ* +4; v) 


Qv 
(0/dv)my (/vZ* + u; v) 
xp c MM uv Os 
my (/vZ* + p v) 


(20) 


86 E. I. GEORGE, F. LIANG AND X. XU 


where 


0 
a Ma (vV uz" + us v) 


ð l Er tuii) T 
Etre ume eda TU V SLE ME EN d 
i (Qn v)P/2 ap] 2v TUM 


dezht eE ap) "ON 
[xt Ro pa) tel) du 


3 
= 3p v- [ERE yay yn (uU) dy. 


Using the fact that 
a 1 

(21) —mg(z; v) = 2 V^mq (z; v), 
QU 2 


which is straightforward to verify, and by Brown's representation E; (u'|z) = z + 
vV log m; (z) from (10), 


(8/89v)mg (/vZ* + u; v) 
T ——— mee e adea ee 
m (/ v Z* + u; v) 


(i V^ma.(Z;v) | (Z— uy Vlogms(Z; 22 
^72. m«(Z;v) 2v 


(22) 


Finally, by (2.3) of [22], 
(Z — u)'Vlogmz(Z; v) 


E 
xí 2v 
l_. ] ,Vmga(Z; v) 
(23) = Env 2^ logma (Z; v) = Ba oa) 
1 /V?°ma(Z; v) 
(24) = Euo ( S — [V ogma (Z; o)l?) 
x (Z; v) 


Combining (20), (22) and (24) yields (18). That (18) equals (19) can be verified 
directly. C 


It may be of independent interest to note that the intermediate step (21) is in fact 
a restatement of the well-known fact that any Gaussian convolution will solve the 
homogeneous heat equation, which has a long history in science and engineering; 
for example, see [20]. Brown, DasGupta, Haff and Strawderman [5] recently used 
identities derived from the heat equation, including one bearing a formal similarity 
to (21), in other contexts of inference and decision theory. Furthermore, as the As- 
sociate Editor kindly pointed out to us, the proof of Lemma 3 can also be obtained 
by appealing to Theorem 1 and equation (54) of that paper. 


IMPROVED MINIMAX PREDICTIVE DENSITIES 87 


THEOREM 1. Suppose ms (z; vx) is finite for all z. 


(i) If V^mg (z; v) < 0 for all vy < v € vy, then px (y|x) is minimax under 
Rx. Furthermore, py (y|x) dominates py(y|x) unless x = xy. 

(ii) If V^ /ms (z; v) x 0 for all vy € v < vx, then pa (y|x) is minimax un- 
der Rx. Furthermore, pa (ylx) dominates pu(ylx) if for all vy < v < v, 


V? ms (2; v) < 0 ona set of positive Lebesgue measure. 


PROOF. As established in Corollary 1, py is minimax under Rg. Thus, mini- 
maxity is established by showing that (16) is nonnegative, and dominance is estab- 
lish by showing that (16) is strictly positive on a set of positive Lebesgue measure. 
Then (i) and (ii) follow from (18), (19) and the fact that vy < vy. O 


COROLLARY 2. If mz(z; vy) is finite for all z, then p«(ylx) will be mini- 
max if the prior density x satisfies V^x (u) <0 a.e. Furthermore, ps(y|x) will 
dominate py(y|x) unless x = ny. 


PROOF. It is straightforward to show (see problem 1.7.16 of [15]) that 
V?m4(z; v) € 0 when V?z (yu) <0 a.e. Therefore, Corollary 2 follows immedi- 
ately from (i) of Theorem 1. LL] 


The above sufficient conditions for minimaxity and domination in the KL risk 
prediction problem are essentially the same as those for minimaxity and domina- 
tion in the quadratic risk estimation problem. What drives this connection 1s re- 
vealed by comparing Stein's unbiased estimate of quadratic risk reduction in (11) 
and (12) with (18) and (19). It follows directly from this comparison that the risk 
reduction in the quadratic risk estimation problem can be expressed in terms of 
log m; as 


^ x 0 
QS) Rost fixe) — Rot. fix) = -2| = Ey,vlogms (Z; v) | 


p=! 

3. Examples. In this section we show how Theorem 1 and Corollary 2 can be 
applied to establish the minimaxity of py and p,. Compared to the minimaxity 
proofs of Komaki [13] for py, and of Liang [16] for Pa, this unified approach is 
more direct and more general. We further indicate how our approach can be used 
to obtain wide classes of new minimax prediction densities. 


EXAMPLE]. Let us return to the Bayes predictive density py, the special case 
of (3) under the harmonic prior (jz) in (6). Following Komaki [13], the marginal 
of Z|u ~ Np(u, vI) under zy can be expressed as 


(26) mp(z; v) x v Ppp (lz/ v), 


88 E. I. GEORGE, F. LIANG AND X. XU 


wept u paucae . : I 
where ó,(u) =u Jo t exp(—t) dt is the incomplete Gamma func 
tion. By Lemma 2, py can be expressed in terms of this marginal as 


my(w; vy) n 
— DUO). 
my(x; Ux) 


(27) PuQ|x) = 
Because my is harmonic [V7y(w) = 0 a.e.], and hence superharmonic, for p > 3, 
the fact that py is minimax and dominates py follows immediately from Corol- 
lary 2. 

Beyond py, one might consider the class of Bayes predictive densities Py corre- 
sponding to the (improper) multivariate t priors zx (w) = (lul? + 2/a2) € *»/2, 
Because these priors are superharmonic for a,  —1 and p > 3, the minimaxity 
and domination of py by these rules follows immediately from Corollary 2. 


EXAMPLE 2. ‘Turning next to fa, the marginal of Z|u ^ N (p, vI) under the 
Strawderman prior xa in (7) can be expressed as 


mg(z; v) X MCCET + 1) Z 


: | lz/ vvl? 
exp — 
2((vo/v)s + 1) 


Because xg is the special case of ma when a = 2, it follows that my(z; v) 
is the special case of mg(z; v) when a = 2. As Fourdrinier, Strawderman and 
Wells [6] showed, the marginal for any proper prior cannot be superharmonic, 
so that Theorem 1(i) cannot hold for pg when a « 1. However, Theorem 1(ii) 
does hold for such pg, because ./mg(z; v) is superharmonic for v < vo when 
p =5 and a € [0.5, 1) or p > 6 and a e [0, 1). This fact can be obtained using 
h(s) ox (1 +5)*~? in Theorem 2 below, which extends Theorem 1 of [6]. 


(28) 
lo deg 


THEOREM 2. For a nonnegative function h(s) over [0, oo), consider the scale 
mixture prior 


(29) m2) = | nilsv)hG) ds, 
where x (u|svo) = Np(0, svol). For Z|u ~ Np(u, vI), let 
oo 2 
(30) Mh(Z; v) xf {27 v(s + piv exp| EOE ners) ds 


- be the marginal distribution of Z under n4 (pu), where r = v/v. Let h be a positive 
function such that: 


(i) —(s + DA'(s)/ h(s) can be decomposed as l1(s) + l2(s), where l4 < A is 
nondecreasing while 0 < l} < B with +A + B<(p—2)/4, 


IMPROVED MINIMAX PREDICTIVE DENSITIES 89 


(ii) lims h(s)/(s + D^? — 0. 


Then ./m;,(z; v) in (30) is superharmonic for all v € vo, and when v, < vo, the 
Bayes predictive density D&(y|x) under nx (p) in (29) is minimax. 


PROOF. The proof of Theorem 1 in [6] shows that ~ma (z; vo) in (30) is su- 
perharmonic when vp = 1, and it is straightforward to show that this is true for 
general vo. From this fact, ./m,;(z; v) will be superharmonic for all v < vo if 
h,(s) :=rh(rs) satisfies (1) and (11) when r c (0, 1]. 

First we show that h, satisfies (i). By the assumptions on h, we have 
—(s + 1)h’(s)/h(s) decomposed as Îi (s) + I2 (s). Then 











h.(s) | r(s +1) h (rs) 
—(s + Des) sal (rs +1) hs) 
D E 
= T > D (s) -- )1. 


Choose J, to be /, multiplied by r(s + 1)/(rs + 1). They can be checked to satisfy 
the conditions since the factor (rs + r)/(rs + 1) is a nondecreasing function of s 
and less than or equal to 1 when 0 <r < 1. To see that h, satisfies (i1), note that 


h, (s) h(rs) (= + T 
ALL LANE lu. dii 
(st1)P/2  (rs-- 1)» \s+l 
goes to zero when s — oo since the first term goes to zero by the assumption on A. 
Thus 4/ mg (z; v) will be superharmonic for all v < vo. When v, < vo, the mini- 
maxity of p;,(y|x) then follows from (ii) of Theorem 1. O 





Going far beyond these results, Theorem 2 can be used to obtain wide classes 
of proper priors that yield minimax Bayes predictive densities p}. Following the 
development in Section 4 of [6], such p; can be obtained with particular classes 
of shifted inverted gamma priors and classes of generalized t-priors. 


4. Further extensions. Priors such as my and 7, are concentrated around 0, 
so that the risk reduction offered by py and pg will be most pronounced when 
u is close to 0. However, such priors can be readily recentered around a differ- 
ent point to obtain predictive estimators that obtain risk reduction around the new 
point. Because the superharmonicity of my and ./m, will be unaffected under 
such recentering, the minimaxity and domination results of Theorems 1 and 2 will 
be maintained. Minimax shrinkage toward a subspace can be similarly obtained by 
recentering such priors around the projection of jz onto the subspace. 

To vastly enlarge the region of improved performance, one can go further and 
construct analogues of the minimax multiple shrinkage estimators of George [7—9] 
that adaptively shrink toward more than one point or subspace. Such estimators 


90 


E. I. GEORGE, F. LIANG AND X. XU 


can be obtained using mixture priors that are convex combinations of recentered 
superharmonic priors at the desired targets. Because convex combinations of su- 
perharmonic functions are superharmonic, Corollary 2 shows that such priors will 
lead to minimax multiple shrinkage predictive estimators. 


Acknowledgments. We would like to thank Andrew Barron, Larry Brown, 
John Hartigan, Bill Strawderman, Cun-Hui Zhang, an Associate Editor and anony- 
mous referees for their many generous insights and suggestions. 


[1] 
[2] 


[3 


—Q 


[4] 
[5] 


[6] 
[7] 
[8] 


[9 


a 


[10] 
[11] 
[12] 
[13] 
[14] 
[15] 
[16] 


[17] 


[18] 


REFERENCES 


AITCHISON, J. (1975) Goodness of prediction fit Biometrika 62 547-554. MR0391353 

ASLAN, M. (2002). Asymptotically minimax Bayes predictive densities. Ph.D dissertation, 
Dept. Statistics, Yale Univ. 

BERGER, J. O (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, 
New York. MR080461 1 

Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary 
value problems Ann Math. Statist. 42 855—903. MR0286209 

BROWN, L. D., DASGUPTA, A., HAFF, L. R. and STRAWDERMAN, W. E. (2006). The heat 
equation and Stern’s identity. Connections, applications. J. Statist. Plann. Inference 136 
2254-2278. 

FOURDRINIER, D., STRAWDERMAN, W., E. and WELLS, M T. (1998). On the construction 
of Bayes minimax estimators. Ann. Statist. 26 660— 671. MR1626063 

GEORGE, E. I. (1986). Minimax multiple shrinkage estimation. Ann. Statist 14 188—205. 
MR0829562 

GEORGE, E. I. (1986). Combining minimax shrinkage estimators. J. Amer. Statist Assoc. 81 
437—445. MR0845881 

GEORGE, E. I. (1986). A formal Bayes multiple shrinkage estimator Comm Statist. A— 
Theory Methods 15 2099-2114. MR0851859 

HARRIS, I. R. (1989). Predictive fit for natural exponential families. Biometrika 76 675—684. 
MR1041412 

HARTIGAN, J. A. (1998). The maximum likelihood prior. Ann. Statist. 26 2083-2103. 
MR1700222 

KOMAKI, F. (1996). On asymptotic properties of predictive distributions. Biometrika 83 299— 
313. MR1439785 

KOMAKI, F. (2001) A shrinkage predictive distribution for multivariate normal observables. 
Biometrika 88 859—864. MR1859415 

KOMAKI, F. (2004). Simultaneous prediction of independent Poisson observables. Ann. Statist. 
32 1744—1769. MR2089141 

LEHMANN, E. L. and CASELLA, G. (1998). Theory of Point Estimation, 2nd ed. Springer, 
New York. MR1639875 

LIANG, F (2002) Exact minimax procedures for predictive density estimation and data com- 
pression. Ph.D dissertation, Dept Statistics, Yale Univ. 

LIANG, F. and BARRON, A. (2004). Exact minimax strategies for predictive density estima- 
tion, data compression and model selection. IEEE Trans. Inform. Theory 50 2708-2726. 
MR2096988 

MURRAY, G. D. (1977). A note on the estimation of probability density functions. Biometrika 
64 150—152. MR0448690 


IMPROVED MINIMAX PREDICTIVE DENSITIES 91 


[19] NG, V. M. (1980). On the estimation of parametric density functions. Biometrika 67 505—506. 


MR0581751 
[20] STEELE, J. M. (2001). Stochastic Calculus and Financial Applications. Springer, New York. 
MR1783083 


[21] STEIN, C. (1974). Estimation of the mean of a multivariate normal distribution. In Proc. Prague 
Symposium on Asymptotic Statistics (J. Hajek, ed.) 2 345—381. Univ. Karlova, Prague. 
MR0381062 

[22] STEIN, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 
1135-1151. MR0630098 

[23] STRAWDERMAN, W. E. (1971). Proper Bayes minimax estimators of the multivariate normal 
mean. Ann. Math. Statist 42 385-388. MR0397939 

[24] SWEETING, T. J., DATTA, G. S. and GHOSH, M. (2006). Nonsubjective priors via predictive 
relative entropy regret. Ann. Statist. 34 441—468. 


E. I. GEORGE F. LIANG 
DEPARTMENT OF STATISTICS INSTITUTE OF STATISTICS 
THE WHARTON SCHOOL AND DECISION SCIENCES 
UNIVERSITY OF PENNSYLVANIA DUKB UNIVERSITY 
PHILADELPHIA, PENNSYLVANIA 19104-6340 DURHAM, NORTH CAROLINA 27708-0251 
USA USA 
E-MAIL edgeorge  wharton upenn.edu E-MAIL: feng @isds duke.edu 
X XU 


DEPARTMENT OF STATISTICS 
OHIO STATE UNIVERSITY 
COLUMBUS, OHIO 43210-1247 
USA 

E-MAIL: xinyi @stat.ohio-state edu 


The Annals of Statistics 

2006, Vol 34, No 1, 92-122 

DOI 10 1214/009053605000000859 

© Insirmte of Mathematical Statistics, 2006 


SEQUENTIAL CHANGE-POINT DETECTION WHEN 
UNKNOWN PARAMETERS ARE PRESENT IN 
THE PRE-CHANGE DISTRIBUTION! 


By YAJUN MEI 
California Institute of Technology and Fred Hutchinson Cancer Research Center 


In the sequential change-point detection literature, most research speci- 
fies a required frequency of false alarms at a given pre-change distribution fo 
and tries to minimize the detection delay for every possible post-change dis- 
tribution g}. In this paper, motivated by a number of practical examples, we 
first consider the reverse question by specifying a required detection delay 
at a given post-change distribution and trying to minimize the frequency of 
false alarms for every possible pre-change distribution fg. We present asymp- 
totically optimal procedures for one-parameter exponential families. Next, 
we develop a general theory for change-point problems when both the pre- 
change distribution fg and the post-change distribution g} involve unknown 
parameters We also apply our approach to the special case of detecting shifts 
in the mean of independent normal observations. 


1. Introduction. Suppose there is a process that produces a sequence of in- 
dependent observations X1, X2,.... Initially the process is “in control” and the 
true distribution of the X's is fg for some 0 € ©. At some unknown time v, the 
process goes “out of control" in the sense that the distribution of Xy, Xy41,... 
is g} for some A € A. It is desirable to raise an alarm as soon as the process is 
out of control so that we can take appropriate action. This is known as a change- 
point problem, or quickest change detection problem. By analogy with hypothesis 
testing terminology [12], we will refer to © (A) as a "simple" pre-change (post- 
change) hypothesis if it contains a single point and as a "composite" pre-change 
(post-change) hypothesis if it contains more than one point. 

The change-point problem originally arose from statistical quality control, and 
now it has many other important applications, including reliability, fault detection, 
finance, signal detection, surveillance and security systems. Extensive research has 
been done in this field during the last few decades. For recent reviews, we refer 
readers to [1, 9] and the references therein. 

In the simplest case where both © and A are simple, that is, the pre-change 
distribution fa and the post-change distribution g, are completely specified, the 


Received February 2004, revised February 2005. 
I Supported in part by NIH Grant RO1 AI055343. 
AMS 2000 subject classifications. Primary 62L10, 62L15; secondary 62F05. 
Key words and phrases. Asymptotic optimality, change-point, optimizer, power one tests, quality 
control, statistical process control, surveillance. 


92 


DETECTION WITH COMPOSITE PRE-CHANGE 93 


problem is well understood and has been solved under a variety of criteria. Some 
popular schemes are Shewhart's control charts, moving average control charts, 
Page's CUSUM procedure and the Shiryayev-Roberts procedure; see [1, 17, 24— 
26]. The first asymptotic theory, using a minimax approach, was provided in [14]. 

In practice, the assumption of known pre-change distribution fg and post- 
change distribution g} is too restrictive. Motivated by applications in statistical 
quality control, the standard formulation of a more flexible model assumes that 
O is simple and A is composite, that is, fo 1s completely specified and the post- 
change distribution g} involves an unknown parameter A. See, for example, [9—-11, 
14, 20, 21, 29]. When the true 0 of the pre-change distribution fg is unknown, it is 
typical to assume that a training sample is available so that one can use the method 
of “point estimation" to obtain a value 69. However, it is well known that the per- 
formances of such procedures are very sensitive to the error in estimating @; see, 
for example, [30]. Thus we need to study change-point problems for composite 
pre-change hypotheses, which allow a range of "acceptable" values of 0. 

There are a few papers in the literature that use a parametric approach to deal 
with the case when the pre-change distribution involves unknown parameters (see, 
e.g., [6, 8, 22, 33, 34]), but all assume the availability of a training sample and/or 
the existence of an invariant structure. In this paper, we make no such assumptions. 
Our approach is motivated by the following examples. 


EXAMPLE 1.1 (Water quality). Suppose we are interested in monitoring a 
contaminant, say antimony, in drinking water. Because of its potential health ef- 
fects, the U.S. Environmental Protection Agency (EPA) sets a maximum contam- 
inant level goal (MCLG) and a maximum contaminant level (MCL). An MCLG 
is a nonenforceable but desirable health-related goal established at the level where 
there is no known or expected risk to health. An MCL is the enforceable limit 
set as close to the MCLG as possible. For antimony, both MCL and MCLG are 
0.006 mg/L. Thus the water quality is "in control" as long as the level of the 
contaminant is less than MCLG, and we should take prompt action if the level 
exceeds MCL. 


EXAMPLE 1.2 (Public health surveillance). Consider the surveillance of the 
incidence of rare health events. If the underlying disease rate 1s greater than some 
specified level, we want to detect it quickly so as to enable early intervention from 
a public health point of view and to avoid a much greater tragedy. Otherwise, the 
disease is "in control.” 


EXAMPLE 1.3 (Change in variability). In statistical process control, some- 
times one is concerned about possible changes in the variance. When the value 
of the variance is greater than some pre-specified constant, the process should be 
stopped and declared “out of control.” However, when the process is in control, 
there typically is no unique target value for the variance, which should be as small 
as the process permits. 


94 Y. MEI 


EXAMPLE 1.4 (Signal disappearance). Suppose that one is monitoring or 
tracking a weak signal in a noisy environment. If the signal disappears, one wants 
to detect the disappearance as quickly as possible. Parameters 0 associated with the 
signal, for example, its strength, are described by a composite hypothesis before it 
disappears, but by a simple hypothesis (strength equal to zero) afterward. 


The essential feature of these examples is that the need to take action in re- 
sponse to a change in a parameter 0 can be defined by a fixed threshold value. This 
inspires us to study change-point problems where © is composite and A is sim- 
ple. Unlike the standard formulation which specifies a required frequency of false 
alarms, our formulation specifies a required detection delay and seeks to minimize 
the frequency of false alarms for all possible pre-change distributions fg. Section 2 
uses this formulation to study the problem of detecting a change of the parameter 
value in a one-parameter exponential family. It is worthwhile pointing out that 
the generalized likelihood ratio method does not provide asymptotically optimal 
procedures under our formulation. 

It is natural to combine the standard formulation with our formulation by con- 
sidering change-point problems when both © and A are composite, that is, both 
the pre-change distribution and the post-change distribution involve unknown pa- 
rameters. Ideally we want to optimize all possible false alarm rates and all possible 
detection delays. Unfortunately this cannot be done, and there is no attractive de- 
finition of optimality in the literature for this problem. In Section 3, we propose 
a useful definition of "asymptotically optimal to first order" procedures, thereby 
generalizing Lorden's asymptotic theory, and develop such procedures with the 
idea of "optimizer." 

This paper is organized as follows. In the remainder of this section we provide 
some notation and definitions based on the classical results for the change-point 
problem when both © and A are simple. Section 2 establishes the asymptotic op- 
timality of our proposed procedures for the problem of detecting a change of the 
parameter value in a one-parameter exponential family, and Section 3 develops an 
asymptotic theory for change-point problems when both the pre-change distribu- 
tion and the post-change distribution involve unknown parameters. Both Sections 
2 and 3 contain some numerical simulations. Section 4 illustrates the application 
of our general theory to the problem of detecting shifts in the mean of independent 
normal observations. Section 5 contains the proof of Theorem 2.1. 

Denote by PP, E the probability measure and expectation, respectively, 
when X,,..., Xy--j are distributed according to a pre-change distribution fg for 
some 6 € © and X,, X,..1,... are distributed according to a post-change distrib- 
ution g, for some A € A. We shall also use Pg and Eg to denote the probability 
measure and expectation, respectively, under which X1, X5,... are independent 
and identically distributed with density fg (corresponding to v = oo). In change- 
point problems, a procedure for detecting that a change has occurred is defined as 


DETECTION WITH COMPOSITE PRE-CHANGE 95 


a stopping time N with respect to {Xn}n>1. The interpretation of N is that, when 
N — n, we stop at n and declare that a change has occurred somewhere in the 
first n observations. The performance of N is evaluated by two criteria: the long 
and short average run lengths (ARL). The long ARL is defined by Eg N. Imagin- 
ing repeated applications of such procedures, practitioners refer to the frequency 
of false alarms as 1/EgN and the mean time between false alarms as Eg N. The 
short ARL can be defined by the following worst case detection delay, proposed 
by Lorden [14]: 


E,N = sup(ess sup Eg, [NN —v+1)"|X1,..., X 5-31. 


Note that the definition of E; N does not depend upon the pre-change distribu- 
tion fg by virtue of the essential supremum, which takes the “worst possible X's 
before the change.” In our theorems we can also use the average detection delay, 
proposed by Shiryayev [25] and Pollak [19], supgee (sup. ; Eg (N — v|N > v)), 
which is asymptotically equivalent to E; N. 

If © and A are simple, say © = (0) and A = {A}, Page's CUSUM procedure is 
defined by 


(1.1) Tcu (6, a) — infin > 1: LX DEL 


where the notation is used to emphasize that the pre-change distribution is fa. 
Moustakides [16] and Ritov [23] showed that Page's CUSUM procedure Tem (6, a) 
is exactly optimal in the following minimax sense: For any a > 0, 7cw(0, a) min- 
imizes E; N among all stopping times N satisfying Eg N > EgTcM(6, a). Earlier 
Lorden [14] proved this property holds asymptotically. Specifically, Lorden [14] 
showed that for each pair (0, A) 


gi CX go. l 


log Eg N 

1(X,0) ' 
as EgN — oo and 7cow(0,a) attains the lower bound asymptotically. Here 
1(X,0) = Ejilog(gi CX)/ foCX)) is the Kullback—Leibler information number. This 
suggests defining the asymptotic efficiency of a family {N(a)} as 

log Eg N (a) 

umi 1(4,0)E; N(a) 
where {N (a)] is required to satisfy Eg N (a) — oo as a — oo. Then e(6,A) < 1 
for all families, so we can define: 


(1.2) E,N > (1-- o(1)) 


(1.3) e(8, X) = 


DEFINITION 1.1. A family of stopping times (N(a)) is asymptotically effi- 
cient at (0, X) if e(0, X) = 1. 


96 Y. MEI 


It follows that Page's CUSUM procedure 7c (0, a) for detecting a change in 
distribution from fo to gj is asymptotically efficient at (0, à). However, TcM(0, a) 
in general will not be asymptotically efficient at (0', A) if 0' 4 0; see Section 2.4 
in [31], equation (2.57) in [28] and Table 1 in [5]. 


2. Simple post-change hypotheses. It will be assumed in this section and 
only in this section that fọ and gj = f; belong to a one-parameter exponential 
family 
(2.1) ft (x) = exp(Ex — b(&)), —co«x«oo,£€Q, 


with natural parameter space 2 = (£,£) with respect to a o-finite measure F. 
Then b(E) is strictly convex on €). Assume that © = [9p, 61] is a subset of 2, and 
À is a given value outside the interval ©, say A > 04. In this section we consider 
the problem of detecting a change in distribution from fg for some 0 € © to fy 
and we want to find a stopping time N such that EoN is as large as possible for 
each 0 € © = [06, 01] subject to the constraint 


(2.2) EN <y, 


where y > 0 is a given constant and A ¢ ©. 

One cannot simultaneously maximize EgN for all 0 € © subject to (2.2) 
since the maximum for each 0 is uniquely attained by Page's CUSUM proce- 
dure TcM(0,a) in (1.1). As one referee pointed out, if one wants to maximize 
infoeg Ea N subject to (2.2), then the exactly optimal solution is Page's CUSUM 
procedure TcM(01, a) for detecting a change in distribution from fo, to fa. This 
is because infocg Eo N < Ee, N with equality holding for N = Tcw(01, a), which 
maximizes Eo, N among all stopping times N satisfying EN < Ey Tom (61, a). In 
other words, this setup is equivalent to the simplest problem of detecting a change 
in distribution from fg, to fy. 

In this section, rather than be satisfied with just infgeo Eg N, a lower bound 
on Es N over 0 € ©, we want to maximize Eg N asymptotically for each 0 € © as 
y — oo, or equivalently, to find a family of stopping times that is asymptotically 
efficient at (0, à) for every 0 € © = [09, 01]. 

Before studying change-point problems in Section 2.2, we first consider the cor- 
responding open-ended hypothesis testing problems in Section 2.1, since the basic 
arguments are clearer for hypothesis testing problems and are readily extendable 
to change-point problems. 


2.1. Open-ended hypothesis testing. Suppose X;, X2,... are independent 
and identically distributed random variables with probability density fe of the 
form (2.1) on the natural parameter space 2 = (E, £). Suppose we are interested 
in testing the null hypothesis 7 


Ho:& € © = [09, 61] 


DETECTION WITH COMPOSITE PRE-CHANGE 97 


against the alternative hypothesis 
Hji:SeA-—(X]), 


where E£ < 09 <0; <A <é. 

Motivated by applications to change-point problems, we consider the following 
open-ended hypothesis testing problems. Assume that if Ho is true, sampling costs 
nothing and our preferred action is just to observe X1, X2,... without stopping. 
On the other hand, if A is true, each observation costs a fixed amount and we 
want to stop sampling as soon as possible and reject the null hypothesis Ap. 

Since there is only one terminal decision, a statistical procedure for an open- 
ended hypothesis testing problem is defined by a stopping time N. The null hy- 
pothesis Hj is rejected if and only if N < oo. A good procedure N should keep the 
error probabilities Pg(N < oo) small for every 0 € © while keeping E; N small. 

The problem in this subsection is to find a stopping time N such that 
P(N < oo) will be as small as possible for every 0 € © = [09,01] subject to 
the constraint 


(2.3) E,N <y, 


where y > Q is a given constant. 

For each 0 € ©, by [32], the minimum of P(N < oo) is uniquely attained 
by the one-sided sequential probability ratio test (SPRT) of Ho, :E = 0 versus 
H; :E =, which is given by 


Í AX D 
n>I1:) log > : 
=ni] Y Ji 0 Cale 
In order to satisfy (2.3), it is well known that Cg ^ J(A,6@)y; see, for example, 
page 26 of [28]. A simple observation is that the null hypothesis is expressed as a 


union of the individual null hypotheses, Ho, :5 = 0, and so the intersection-union 
method (see [2]) suggests considering the stopping time 








(2.4) M(a)=inf EID: ) log I» x3 10. Da forall dp s0 <01}; 
i=] 


The rationale is that Ho can be rejected only if each of the individual null hypothe- 
ses Ho, : 5$ — 0 can be rejected. 
In order to study the behavior of M (a), it is useful to express M (a) in terms of 
Sn = X14: 4- Xn. Define 
b(A) — b(0) 
2:9 0) = —————. 
(2.5) $(9)— ——— 


Then by (2.1), the stopping time M (a) can be written as 


Q6  M(O— intl Sas edad sme (= €) 
09 <6 «61 


98 Y. MEI 


because A > 01. Now $ (0) is an increasing function since b(@) is convex, thus the 
supremum in (2.6) is attained at 0 = 09 if n < a, and at 0 = 0; ifn > a. Therefore, 
M (a) is equivalent to the simpler test which uses two simultaneous SPRTs (with 
appropriate boundaries), one for each of the individual null hypotheses 05, 0;. This 
fact makes it convenient for theoretical analysis and numerical simulations. 

The following theorem, whose proof is given in Section 5, establishes the as- 
ymptotic properties of M (a) for large a. 


THEOREM 2.1. Forany a » 0 and all 09 <9 x 6, 
[log Po(M(a) < oo)| _ 











(2.7) >a, 
I (4,0) 
and as a — oo 
(2.8) E M (a) =a + (C 4- o(1)) a, 
where 
l ep. ke b (X 
(2.9) C= (Fz E " a) x > 0. 


The following corollary establishes the asymptotic optimality of M (a). 


COROLLARY 2.1. Suppose {N(a)} is a family of stopping times such that 
EN (a) < E, M (a). For all 09 € 0 € 0, as a — co, 


| log Pa (N (a) < oo)| 
————————— «& C 1 , 
70.6) <a+ (C  o(D)4/a 
where C is as defined in (2.9). Thus M(a) asymptotically minimizes the error 
probabilities Pa(N < oo) for every 0 € © = [05, 01] among all stopping times N 
such that EN < E, M (a). 


PROOF. Thecorollary follows directly from Theorem 2.1 and the well-known 
fact that 
| log Pe(N(a) < oo) 
—————— «EN 
1(A,0) c. 
for all 0 € [09,01]. LJ 


2.2. Change-point problems. Now let us consider the problem of detecting 
a change in distribution from fg for some 0 € © = [05,01] to fa. As described 
earlier, we seek a family of stopping times that is asymptotically efficient at (0, A) 
for every 0 € ©. 

A method for finding such a family is suggested by the following result, which 
indicates the relationship between open-ended hypothesis testing and change-point 
problems. 


DETECTION WITH COMPOSITE PRE-CHANGE 99 


LEMMA 2.1 (Lorden [14]. Let N be a stopping time with respect to 
X1, X5,.... For k= 1,2,..., let Ny denote the stopping time obtained by ap- 
plying N to Xy, Xy 41, ... fork =1,2,..., and define 


N* = min(N; +k — 1). 
kl 


Then N* is a stopping time with 
EoN* > 1/Pa(N <œ) and E,N* <E,N 
for any 0 and À. 
Let M (a) be the stopping time defined in (2.4), and let M, (a) be the stopping 


time obtained by applying M (a) to the observations X4, Xx+1,.... Define a new 
stopping time by M* (a) = miny-1(My(a) + k — 1). In other words, 


? * — inf fa i) 
(2.10) M*(a)= ntn: m uat, (ce X MODE 


The next theorem establishes the asymptotic performance of M* (a), which im- 
mediately implies that the family {M*(a)} is asymptotically efficient at (0, A) for 
every 0 c O. 





THEOREM 2.2. For anya > 0and 0g x0 <4), 
(2.11) Eo M* (a) > exp(1 (À, 0)a), 
and as a — oo, 

(2.12) E; M*(a) x a + (C - o(1)) Ja, 


where C is as defined in (2.9). Moreover, if {N (a)] is a family of stopping times 
such that (2.11) holds for some 0 with N (a) replacing M* (a), then 


(2.13) E,N(a)-a--O(1 | asa co. 
PROOF. Relations (2.11) and (2.12) follow at once from Theorem 2.1 and 


Lemma 2.1. Relation (2.13) follows from the following proposition, which im- 
proves Lorden’s lower bound in (1.2). O 


PROPOSITION 2.1. Given 0 andi Æ 0, there exists an M = M(0, à) > 0 such 
that for any stopping time N, 


(2.14) logEoN < 1(A,0)E,N +M. 


100 Y. MEI 


PROOF. By equation (2.53) on page 26 of [28], there exist Cy and C2 such 
that for Page's CUSUM procedure Joy (0, a) in (1.1), 
EeTcw(0,a) < Cie? and [(A, 0)E,Tom(@, a) > a — C2 


for all a > 0. For any given stopping time N, choose a = log EeN — log C4; then 
Eo N = Cie? > EoTcw(0, a). The optimality property of Tcw(0, a) [16] implies 
that 


I (4,0)log E,N > 1(4,0)logE; Tcw(0, a) 
> a — Cz = log Eo N — log Ci — Co. [1 


The following corollary follows at once from Theorem 2.2. 


COROLLARY 2.2. Suppose (N (a)) is a family of stopping times such that 
E,N(a) < E, M*(a). 
Then for all 0 < 0 < 01, as a — oo, 
log Eo N (a) 
1 (A, 0) 


where C is as defined in (2.9). Thus, as a — 00, M* (a) asymptotically maximizes 
log EN [up to O(./a)] for every 0 € [0,61] among all stopping times N such 
that E; N < E; M* (a). 


«a - (C t o(1)) Aa, 


REMARK. The O(./a) terms are the price one must pay for optimality at 
every pre-change distribution fg. 


In order to implement the stopping times M*(a) numerically, using (2.6), we 
can express M* (a) in the following convenient form: 





esl. | fe) 
M*(a)= nn >I] irs 2. ] fa (X) > I (à, Qoa, 
(2.15) 
3 fi (Xi) 
Wn- -+ l P e T Tigug I À, 0 , 
= : jesn-barl = fe (X) p Da} 


where b = [a], Wy = max{Wz—1, 0} + log(Ja(Xk)/fo CX&)) and Wo = 0. Since 
Wy can be calculated recursively, this form reduces the memory requirements at 
every stage n from the full data set (X1,..., Xn} to the data set of size b + 1, 
that is, (X, p, Xn p1, ..., Xn}. It is easy to see that this form involves only O(a) 
computations at every stage n. 

As an Associate Editor noted, there are other procedures that can have the same 
asymptotic optimality properties as M* (a). For example, if we define a slightly 


DETECTION WITH COMPOSITE PRE-CHANGE 101 


different procedure Mf (a) by switching infg,<9<9, with maX1<k<n in the defini- 
tion of M*(a) in (2.10), or if we define M5(a) = supg <g<g, (TcM(0, I (à, @)a)}, 
where Tcw(0, 1 (.,0)a) is Page's CUSUM procedure for detecting a change in 
distribution from fg to f; with log-likelihood ratio boundary 7 (4, @)a, then both 
Mt (a) and M5(a) are well-defined stopping times that are asymptotically effi- 
cient at (0, À) for every 0 € ©. However, both Mf(a) and M5 (a) are difficult to 
implement, although one can easily implement their approximations which replace 
© = [00, 01] by a (properly chosen) finite subset of ©. 

It is important to emphasize that in all the above procedures we should choose 
appropriate stopping boundaries. Otherwise the procedures may not be asymptoti- 
cally efficient at every 0 € ©. For instance, motivated by the generalized likelihood 
ratio method, one may want to use the procedure 


PORNO T P d 
T' (a) = inf{n Schieber MAC) a} 
i<k<n SUDa, <6 <6, (fa (Xx) a fo (Xn) 


n 
= atl: 1]: b RUM ja — 8) 2X -— 6| > al, 
where $(0) is defined in (2.5). Unfortunately, for all a > 0, T’(a) is equiva- 
lent to Page's CUSUM procedure 7cw(01, a), and thus it will not be asymptoti- 
cally efficient at every 0. To see this, first note that T'(a) > Tcw(01,a) by their 
definitions. Next, if 7cw(01,a) stops at time no, then for some 1 < ko < no, 
an (Xı — $(01)) > a/(X — 61) since à > 6. Thus, if a > 0, then for all 
09 € 0 <6)(<A), 





ii = a a 
2.6 — $(8)) 2 2,0 — $(01)) 2 X — 6; > eae 


because $ (0) is an increasing function of 0. This implies that T’(a) stops before or 
at time no and so T’(a) < Tcw(01, a). Therefore, T'(a) = Tcw(01, a). Similarly, 
if one considers 7" (a) = supg, «9-6, {Tcm(@, a), then T"(a) is also equivalent to 
1CM(01, a), because for all a > 0, Page's CUSUM procedure 7c (0, a) is increas- 
ing as a function of 0 € [6p, 01] in the sense that Tcw(0, a) x Tcw(0', a) if 0 < 0’. 


2.3. Extension to half-open interval. Suppose X;, X2,... are independent 
and identically distributed random variables with probability density fẹ of the 
form (2.1) and suppose we are interested in testing the nul] hypothesis 


Ho:£ € O = (Ẹ, 01] 
against the alternative hypothesis 
H: € A= {A}, 


102 Y MEI 


& 


where 6; < À. Recall that Q = (£, £) is the natural parameter space of £. Assume 


(2.16) im Eg X = — 
This condition is equivalent to limg_,¢ b’(@) = —oo since b'(0) = Eg X. Many dis- 
tributions satisfy this condition. For example, (2.16) holds for the normal distri- 
butions since Eg X = 0 and E = —oo. It also holds for the negative exponential 
density since b(@) = — log 9, E = 0 and Eg X = b' (0) = —1/0. 

As in (2.4), our proposed open-ended test M (a) of Ho: £ € © = (E, 01] against 
H; :E =A is defined by i 


M (a) = inf EDI TES, > 10, Ha fora <0 «6 


m 


As in (2.6), M (a) can be written as 


(2.17) M (a) = inf ntin > 21: Y x, > b(1)a + D Un — zn 
i=] § <6 
where $ (0) is defined in (2.5). By L’ H8pital’s rule and the condition in (2.16), 
b(X) — b(0) 


pn = lim ————— —— = li '(0) — li =— 
dup 2 A—0 ue) ee 


Thus for any n < a, $7. X, is finite but SUPE «9.6, [(n — a)$ (6)] = œ. So M(a) 
will never stop at time n < a. Recall that $ (0) is an increasing function of 0, hence 
the supremum in (2.17) is attained at 0 = 0; if n > a. Therefore, 


ik) -intln zai) 2108 X. iG) — 1(4,0 Da}, 


fo, (X,) i 


For the problem of detecting a change in distribution from some fg with 0 € 
© = (£,01] to fi, define M*(a) from M (a) as before, so that 


P. 3 : L AXi) i) 
M e) -intls aeoe me PE) > I(XA,0 Dal, 


Using arguments similar to the proof of Theorem 2.2, we have: 
THEOREM 2.3. Fora » 0 and 0 € (£,01], 


Ec M* (a) > exp(1 (à, 0)a), 
and as a — oo, 


E; M*(a) x a + (C 4 o(1)) a, 


DETECTION WITH COMPOSITE PRE-CHANGE 103 


where 





Thus the analogue of Corollary 2.2 holds, and so M* (a) asymptotically maxi- 
mizes log Eo N [up to O(./a)] for every 0 € (€, 01] among all stopping times N 
such that E; N < Ej M* (a). 


2.4. Numerical examples. In this subsection we describe the results of a 
Monte Carlo experiment designed to check the insights obtained from the asymp- 
totic theory of previous subsections. The simulations consider the problem of de- 
tecting a change in a normal mean, where the pre-change distribution fg = N (0, 1) 
with 6 € © = [—1, —0.5], and the post-change distribution f; = N(A, 1) with 
À € A= {0}. 

Table 1 compares our procedure M*(a) and two versions of Page’s CUSUM 
procedure Tem(6o, a) over a range of 0 values. Here 


RAO) | 


— a “Isksn » "AUD" 


a: 


EL > i: max 2 cep — z] zi 


The threshold value a for Page's CUSUM procedure TcM(00, a) and our proce- 
dure M*(a) was determined from the criterion E; N « 20. First, a 10*-repetition 
Monte Carlo simulation was performed to determine the appropriate values of a 
to yield the desired detection delay to within the range of sampling error. With the 
thresholds used, the detection delay E, N is close enough to 20 so that the differ- 
ence is negligible, that is, correcting the threshold to get exactly 20 (1f we knew 


TABLE 1 
Long ARL for different procedures 

0 Best possible M* (a) IcM(C-0.5, a) Tom (—1.0, a) 

(a = 18.50) (a = 2.92) (a = 9.88) 
—0.5 233 +7 206 3:6 235357 125 x3 
—0.6 5234-15 501 x: 15 518415 2978 
—0.7 1384 + 43 1324+43 12272757 938 +29 
—0.8 5157+ 165 4688 + 148 3580 + 113 4148 + 129 
—0.9 22,942 + 699 19,217 + 606 10,613 +343 21,617+658 


—1.0 118,223 +3711 83,619 + 2566 31,641 + 1036 118,223 +3711 


(The best possible values are obtained from an optimal envelope of Page's CUSUM procedures.) 


104 Y. MEI 


how to do that) would change Eg N by an amount that would make little difference 
in light of the simulation errors Eg N already has. Next, using the obtained thresh- 
old value a, we ran 1000 repetitions to simulate long ARL, Eg N, for different 0. 

Table 1 also reports the best possible Eg N at each of the values of @ subject 
to E,N « 20. Note that they are obtained from an optimal envelope of Page's 
CUSUM procedures and therefore cannot be attained simultaneously in practice. 
Each result in Table 1 is recorded as the Monte Carlo estimate + standard error. 

Table 1 shows that M* (a) performs well over a broad range of 0, which is con- 
sistent with the asymptotic theory of M*(a) developed in Sections 2.2 and 2.3 
showing that M*(a) attains [up to O(./a)] the asymptotic upper bounds for 
log Eg N in Corollary 2.2 as a > co. 


3. Composite post-change hypotheses. Let € and A be two compact dis- 
joint subsets of some Euclidean space. Let (f50; 9 € ©} and (g;; à € A} be two 
sets of densities, absolutely continuous with respect to the same nondegenerate 
o -finite measure. In this section we are interested in detecting a change in distri- 
bution from fg for some 0 € © to g, for some A € A. Here we no longer assume 
the densities belong to exponential families, and we assume that both € and A are 
composite. 

Ideally we would like a stopping time N which minimizes the detection de- 
lay ELN for all A € A and maximizes EN for all 0 € ©, that is, we seek a fam- 
ily (N(a)) which is asymptotically efficient for all (0, 4) € © x A. However, in 
general such a family does not exist. For example, for A = (A1, Aq} it is easy to 
see from (1.3) that there exists a family that is asymptotically efficient at both 
(0, A1) and (0, A5) for all 0 € © only if 7(15,0)/1(41,0) is constant in 0 € ©. 
This fails in general when © is composite. For example, if fa and g, belong to a 
one-parameter exponential family and © is an interval, a simple argument shows 
that 7 (42, 0)/1 (41,0) is a constant if and only if Ay = 15. 

It is natural to consider the following definition: 


DEFINITION 3.1. A family of stopping times {N (a)) is asymptotically opti- 
mal to first order if: 


(i) for each 0 € O, there exists at least one Ag € A such that the family is 
asymptotically efficient at (0, Ag); and 

(ii) for each A € A, there exists at least one 6, € © such that the family is 
asymptotically efficient at (05, A). 


REMARK. An equivalent definition is to require that the family {N (a)) is as- 
ymptotically efficient at (hı (8), h2(8)) for 8 € A, where 0 = h1(8) and A = h2(8) 
are onto (not necessary one-to-one) functions from A to © and A, respectively. 
It is obvious that the standard formulation with simple €? and our formulation in 
Section 2 are two special cases of this definition. 


DETECTION WITH COMPOSITE PRE-CHANGE 105 


REMARK. It is worth noting that a family of stopping times that is asymp- 
totically optimal to first order is asymptotically admissible in the following sense. 
A family of stopping times (N(a)) is asymptotically inadmissible if there exists 
another family of stopping times (N'(a)) such that for all 0 € © and all À € A, 


log Ee N (a) _ . .E,N(a) 
2" «1 and ah 
aco log Eg N’(a) a>œ E, N'(a) 


with strict inequality holding for some 0 or à. A family of stopping times is as- 
ymptotically admissible if it is not asymptotically inadmissible. 


Note that when A = {A} is simple, the asymptotically optimal procedure devel- 
oped in Section 2 satisfies 


(3.1) log Ee N (a) ^ I (4,0)a as a — oo. 


Here and everywhere below, x(a) ~ y(a) as a — oo means that lima... o5 (x (a)/ 
y(a@)) = 1. However, when one considers multiple values of the post-change pa- 
rameter A it is no longer possible to find a procedure such that (3.1) holds for all 
(0, 4) € © x A. A natural idea is then to seek procedures such that 


log Ee N (a) ~ p(8)a, 


where p(@) is suitably chosen. It turns out that for “good” choices of p(@) one can 
define {N (a)] to be asymptotically optimal to first order. 
To accomplish this, first consider the following definitions. 


DEFINITION 3.2. A positive continuous function p(-) on © is an optimizer if 
for some positive continuous q(-) on A 








1 (4,0 
p(0) — inf ) 
AEA q(A) 
similarly, g(-) on A is an optimizer if for some positive continuous p(-) on © 
1 (4,0 
q(A) = inf l ) 
0c0 p(@) 


DEFINITION 3.3. Positive continuous functions p(-), q(-) on ©, A, respec- 
tively, are an optimizer pair if for all 6 and A 
cod (0) 4, 10,8) 


(3.2) p(@) = int 10) and q(A)-- inf 20) 








The following proposition characterizes the relation between these two defini- 
tions. 


106 Y. MEI 


PROPOSITION 3.1. Jf (p,q) is an optimizer pair, then p and q are optimizers. 
Conversely, for every optimizer p, there is a q such that (p, q) is an optimizer pair, 
namely, 

sage Ae) 


gA) = inf p(6) 





and, similarly, for every optimizer q one can obtain an optimizer pair (p,q) by 
defining 

1(X,0 
xr Z ) 


pu q(A) © 





PROOF. Itis obvious that p and q are optimizers if (p, q) is an optimizer pair. 
Since everything is symmetric in the roles of p and g, we only need to prove that 
the first equation of (3.2) holds for the case where q is defined after p. Now fix 
09 € €. On the one hand, since q(A) is defined as the infimum over ©, we have 


q(X) < I(4,09)/ p(00), so p(0o) < I (4.,09)/q(4) for all A € A. Thus 


„n LA, 09) 
(3.3) p(00) < inf “aay 


On the other hand, since p is an optimizer by assumption, there exists a func- 
tion go(-) on A such that 


10,8) 
~ REA qo) 


For any Ap € A, we have p(0) < I (ào, 8)/qo(Ag) and so Z (Ag, 8)/ p(0) > qo(A0) 
for all 0 €e ©. Hence 


p(0) 





. ~ 1(30, 8) 
int p) > qoo). 


Observe that the left-hand side is just our definition for g(Ag), and so g(Ao) = 
qo (Ao). Since Ao is arbitrary, we have q (À) > go(A) for all A € A. Thus, 


I (4,0 1 (4,0 
ECCO CENE 
AeA q(A) AEA qo(A) 








by using the definition of p(@). The first equation of (3.2) follows at once from 
this and (3.3). L 


In fact, Proposition 3.1 provides a method to construct optimizer pairs. One can 
start with any positive continuous function q9(A), get an optimizer p(8) from it 
by (3.2) and use the other part of (3.2) to get a (p, q) optimizer pair. Similarly, one 
can also get a (p, q) optimizer pair by starting with a po(@). 


DETECTION WITH COMPOSITE PRE-CHANGE 107 


Now we can define our proposed procedures based on an optimizer p(0). First, 
let 7 be an a priori distribution fully supported on A. Define an open-ended 
test T (a) by 


(3.0 T(a)-— intln : inf 


1 Xp--- gi (X di 

l is falex (X1) ex OG) * za. 
p8) foCX1) -- fo(Xn) 

Then our proposed procedure is defined by T*(a) = ming>1 (Tg (a) +k — 1), where 

Ty (a) is obtained by applying T (a) to Xy, Xx41,.... Equivalently, 


T*(a) = infin 1e 
eo) 1 X X,)In(da 
-— intl lo Jalea Xx) +++ 23 (Xn) In * [za]. 
Ixkxn6e80| p(@) foCXxk) +- fo(Xn) 
We also define a slightly different procedure Tf (a) by switching infgee with 
max «4 in the definition of T*(a). 
Our main results in this section are stated in the next theorem and its corol- 
lary, which establish the asymptotic optimality properties of T*(a) and T;*(a). 
The proofs are given in Section 3.1. 


THEOREM 3.1. Assume that Assumptions Al and A2 below hold and 
© and A are compact. If p(0) is an optimizer, then {T*(a)} and {T;*(a)} are 
asymptotically optimal to first order. 


COROLLARY 3.1. Under the assumptions of Theorem 3.1, if (N (a)) is a fam- 
il» of procedures such that 





EN 
sup 2 @) <1 for all X € ^, 
a—5oo E,T*(a) 
tken 
] N 
P AC M for all 0 € 8. 


a—oo logEgT*(a) - 
Similarly, if 
_ . p log Eo N (a) 
———— 2-1 ll 0 
m Er) 4 364569. 
then 
E,N 
pta ag forallXne ^. 
aco E,T*(q) 


The same assertions are true if T*(a) is replaced by Tř (a). 


108 Y. MEI 


L2 


REMARK. Corollary 3.1 shows that our procedures 7* (a) and Tř (a) are also 
asymptotically optimal in the following sense: If a family of procedures {N (a)] 
performs asymptotically as well as our procedures (or better) uniformly over ©, 
then our procedures perform asymptotically as well as {N (a)) (or better) uniformly 
over A, and the same is true if the roles of © and A are reversed. 


REMARK. ‘Theorem 3.1 and Corollary 3.1 show another asymptotic optimal- 
ity property of our procedures T*(a) and T;*(a): If the optimizer p(0) is con- 
structed from qo(A) by the first equation of (3.2), then our procedures asymptoti- 
cally maximize Eg N for every 0 € © among all stopping times N satisfying 


AEN X y — forallA€ A, 


where y > 0 is given. Here gg(A) > 0 can be thought of as the cost per observation 
of delay if the post-change observations have distribution gj. 


REMARK. Instead of T (a) in (3.4), we can also define the following stopping 
time in open-ended hypothesis testing problems: 


e los El : | 
p(@) fe(X)0:-- fo(Xn) 1 J 

and then use it to construct the corresponding procedures in change-point prob- 
lems. When fg and g; are from the same one-parameter exponential family, we 
can obtain an upper bound on Pe (T (a) < oo) by equation (13) on page 636 in [15], 
and so we get a lower bound on the long ARL. The upper bound on detection de- 
lay follows from the fact that T (a) x T (a). These procedures are, therefore, also 
asymptotically optimal to first order if fg and g, belong to one-parameter expo- 
nential families. 


(3.6) T(a) =inf infln > 1: inf] —— 


REMARK. Note that if p(@) = 1, then all of our procedures are just based on 
generalized likelihood ratios. However, in the case where p(0) = 1 is not an opti- 
mizer, generalized likelihood ratio procedures may not be asymptotically optimal 
to first order. In fact, they are asymptotically inadmissible since they are dominated 
by our procedures based on an optimizer p(@) which is obtained by starting with 


po(8) = 1. 


Throughout this section we impose the following assumptions on the densities 
fo and gj. 


ASSUMPTION Al. The Kullback-Leibler information numbers /(A,@) = 
E; log(g1(X)/ fa (X)) are finite. Furthermore: 


(a) Ig = inf, info 1(4,8) > 0, 
(b) 7(4,0) and I (A) = info (A, 0) are both continuous in A. 


DETECTION WITH COMPOSITE PRE-CHANGE 109 


ASSUMPTION A2. For all 8, A: 


(a) E,[log(g,(X)/fo(X))* < oo, 
(b) lim, o FE, [log supp. 9|, fo (X) — log fo (X)? — 0, 
(c) limy E,[log gx (X) — loggi (X)]^ = 0. 


Assumptions Al and A2 are part of the Assumptions 2 and 3 in [7]. Assump- 
tion Al(a) guarantees that © and A are “separated.” 


3.1. Proof of main results. First we establish the lower bound on the long 
ARLs of our procedures T'* (a) and T* (a) for any arbitrary positive function p(@). 


LEMMA 3.1. Forall a — 0 and 0 € , 
log EoT" (a) > log EgT;*(a) > p(0)a. 


PROOF. Define 
I X1)--- g1(X dA 
(6,4) = inf {n> T p Alex 1): 3 0€3)]n (d^) 2 
p(0) fe (X1) --- fo(Xn) 
and ¢*(0,a) = ming> (te (0, a) + k — 1), where f&(0,a) is obtained by apply- 
ing t(0,a) to Xy, X41, .... Then it is clear that T*(a) > T*(a) > t*(6, a), and 
hence 
EoT" (a) > EgT] (a) = Eo[t" (0, a)]. 
Using Lemma 2.1 and Wald's likelihood ratio identity, we have 
] 
t* (0, a)] > ———————— 
Bole", 4)1 2 5 0G ay « co) 
which proves the lemma. O 


> exp(p(0)a), 


Next we derive an upper bound on the detection delays of our procedures 
T* (a) and Ty (a). 


LEMMA 3.2. Suppose that Assumptions Al and A2 hold and © is compact. 
If p(0) is a positive continuous function (not necessarily an optimizer) on ©, then 
for all X € ^, 
- = a 
E, Ty'(a) < E,T*(a) < (1 + 0(1))—— 
i 170) 
GS a — OO, where q(A) is defined by 
1(4,8 
ap 1050) 


ua - rr 





110 Y. MEI 


PROOF. Bydefinition, E; Tř (a) < EL T* (a) < EAT (a), where T (a) is defined 
in (3.4), so it suffices to show that 


ET (a) < (1 to- 20) 


for any A € A. We will use the method in [7] to prove this inequality. Fix Ao € A 
and choose an arbitrary € > 0. By Assumptions Al and A2, the compactness of © 
and the continuity of p(@), there exist a finite covering (Uj, 1 < i < ke} of © (with 
0; € U;) and positive numbers 6, such that for all A € V; = {A | |A — Ao] < de}, and 
pel ee 


(3.7) Ens log gx(X) — log sup fo Q0 |z 100.04) - e 
EU, 
and 
sup [p(8) — p(6i)| < €. 
0cU, 
Let N; (a) be the smallest n such that 


(3.8) log f [21 (X1) +++ Ba(Xn) In (42) = oP p| pa F >o aX 
j=l 
Clearly N1(a) > T (a). By Jensen's inequality, the left-hand side of (3.8) is greater 
than or equal to 
n(dÀ) 


n(Vz) + log n(Ve) 


L log[gx (X1) --: ga (X)] —— 
(3.9) 


di) 
2 L log gi (X) TES — logn at) 


since n(V;) < 1. Since jn covers ©, the right-hand side of (3.8) 1s less than or 
equal to 


max sup 
1x1 Ke 8 cU, 


p(0)a + 5 log fo(X > 


j=l 


(3.10) < max I ) 4- £)a + sup T fe(X;j ] 
Is eU, ya} 


< m ax [oe y+ e)a + » log sup fe(X, | 


j=l eU, 
For j —1,2,..., put 


Y; = f, oem. ex x ie = 


and Z’ —log sup fe(X,) fori eise 
n(Ve) i 6€U, i i 


DETECTION WITH COMPOSITE PRE-CHANGE 111 


Let N2(a) be the smallest n such that 


$ Y, — max È ZÍ + +P +e)a|> > |logn(Ve)| 
j=! 


1<i<k, 


or, equivalently, the smallest n such that for all 1 <i < kg, 


"Yj — 2, | e | |log n(Ve)| 
>all+ jeu 
2 p) = Lt 9601 pe 


Using (3.9) and (3.10), it is clear that No(a) > Ni (a). Let po = infoeo p(0); then 
po > 0 since p(0) is a positive continuous function and © is compact. Define 
Te = |logn(V-)|/ po, and let N3(a) be the smallest n such that 


n I 

] E 

min >al 1+ — T 
Les p(8,) > a( ve) : 











or, equivalently, 
^ py, — ZI py» 4270 oz E 
J J J J J 
—é€{+ min | -— +e[>a(1+—) +t. 
>| p(&1) | 15i Ske 2- pC) pi) p — 
Clearly N3(a) > N5(a). From (3.7) we have 

















Y; aZ I (Ao, 6,) l 
(3.11) E "Le -e| > -e(1+>) fori = 1 k 
pO pO Po 
For n = 1,2,... define 
n ry m o. 
= lay 
zL PO) 


and 


n rY,- Z! Y,-Zl 
: J J J J | ; 
—— — ————— d- € fori = Lk * 
l p(&) p(8) : 


Let N*(a) be the smallest n such that, simultaneously, 
S >a(14+—) +z and min B! > 0. 
Po 1zi xk, 
Clearly, N*(a) > N3(a). Now it suffices to show that 
(3.12) Ei, N*(a) € (1 + rs) —— "T 


for all sufficiently large a for some rs > 0 which can be made arbitrarily small by 
choosing a sufficiently small e. 


“112 Y. MEI 


To prove (3.12), assume that {U,} are indexed (re-index if necessary) so that the 
minimum (over i) of the left-hand side of (3.11) occurs when i = 1. By the proof 
of Lemma 2 in [7], we have 
(3.13) Ej, N'(a) € Ex, (v1) + Er (v4) Ea, (w), 


where 
Ul = infin: 5, > a(1 -4- =) +f, 
po 
v4. = inf(n: Sa > 0}, 


w = last time min B, <0. 
| €t xk, 


By (3.11) and the definition of q (À), 


Fareoe) 


Thus, if we choose € small enough so that q (ào) — e(1 + 1/ po) > 0, then it is well 
known from renewal theory that 

a(1-- €/ po) t Te 
q (Xo) — €(1 + 1/ po) 
Moreover, Ej, (w) = h(£) < oo because the summands in B, have positive mean 


and finite variance under P;,; see, for example, Theorem D in [7]. Relation (3.12) 
follows at once from (3.13). Therefore, the lemma holds. L1 





E,,(v1) < (1 4- o(1)) and E, (v+) = D(e) < co. 


PROOF OF THEOREM 3.1 AND COROLLARY 3.1. First we establish an upper 
bound of log Eg T* (a). By Lemma 3.2 and Lorden's lower bound (1.2), 


log EgT* (a) < inf ((1-4-0(1))1 G., )E,T*(a)) < inf (Q t o1)1Q., 90-5) 


The compactness of A leads to 





log EgT* (a) < (1 -- o(1)) (int di Ja. 


q (4) 
If p(0) is an optimizer, then (p(0), q(4)) is an optimizer pair by Proposition 3.1. 
Thus 


log EgT*(a) < (1-4- o(1))p(0)a. 
Combining this with Lemma 3.1 yields 
log EgT*(a) ~ p(0)a. 
Similarly, 
E,T*(a) ~ a/qQ), 


DETECTION WITH COMPOSITE PRE-CHANGE 113 


and the same results are true if T* (a) is replaced by T7 (a). 
To prove Theorem 3.1, note that the asymptotic efficiency of T* (a) and T7 (a) 
at (0, À) is 
p(0)q ^.) 
14,0) ' 
and so they are asymptotically optimal to first order by virtue of the compactness 
of 0 and A and the definition of an optimizer pair. 


Applying Lorden's lower bound, we can prove Corollary 3.1 in the same way 
as the upper bound for log Eg7*(a). O 


e(8,X) = 


3.2. Optimizer pairs. The following are some examples of an optimizer 
pair (p, q) and the corresponding asymptotically optimal procedures. 


EXAMPLE 3.1. If there exists 7o such that for all 0 € ©, infi cA IA, 9) = Io, 
then g9(A) = Ip yields 


p(0)—1 and qQ)- inf 10.0). 


This is even true for composite © and A. In particular, if © is simple, say {@p}, 
then our consideration reduces to the standard formulation where the pre-change 
distribution is completely specified. Moreover, Pollak [18] proved that T (a), de- 
fined in (3.4), has a second-order optimality property in the context of open-ended 
hypothesis testing if fo and g, belong to exponential families. 


EXAMPLE 3.2. If there exists J/g such that for all A € A, infoeg I (A, 0) = Ip, 
then go(A) = 1 yields 


p(0) — inf 14,6) and q()—1, 


even for composite © and A. In particular, if A is simple, say {A}, then the con- 
siderations of Section 3 reduce to those of the problem in Section 2. 


EXAMPLE 3.3. Suppose fg and g, are exponentially distributed with un- 
known means 1/0 and 1/4, respectively. Assume © = (0:0 € [09,01]) and A = 
(4:4 € [40, 41]), where 69 < 01 < Ag < A. Then optimizer pairs (p(0), q(A)) are 
not unique. For example, the following two pairs are nonequivalent: 

E = I (ào, 0), | p2(0) = I1, 0M (Qo, 61)/ 1 1, 01), 
q1(.) =I, 80)/I Qo, 09), q2) = 1(A, 01)/I Qo, 01). 
Suppose #*(a) and tš(a) are the procedures defined by (3.5) for the pairs 
(p1(80), q1.)) and (p2(@), g2(A)), respectively. Even though both tf (a) and t5 (a) 
are asymptotically optimal to first order, tf (a) performs better uniformly over © 


(in the sense of larger long ARL), while £7 (a) performs better uniformly over A 
(in the sense of smaller short ARL). 


114 Y. MEI 


3.3. Numerical simulations. In this section we report some simulation studies 
comparing the performance of our procedures with a commonly used procedure in 
the literature. 

The simulations consider the problem of detecting a change in distribution 
from fg to g}, where fo and g, are exponentially distributed with unknown means 
1/0 and 1/4, respectively, and 0 € © = [0.8, 1] and à € A = [2, 3]. 

Note that go(A) = 1 leads to an optimizer p(0) = I (2,0) where F (4,0) =@/A— 
] — log(0/4), and so our procedure based on (3.6) is defined by 


r À — 8 —/loga —logé 
T*(a)=inf{n> 1: max inf sup —— (2877 - x.) >a}. 
1<k<n0.8<56<12<,<3 p(0) mun A —60 


A commonly used procedure in the change-point literature is the generalized 
likelihood ratio procedure which specifies the nominal value 9 (of the parameter 
of the pre-change distribution); see [14] and [29]. The procedure is defined by the 
stopping time 


inf gx (X i) 
t(09, a) = tns: met sup Y log AO =al 


-itfa z1: max sup Y (os - a 0x) =al: 


ISkSNn2<A<3 Lk 


Note that t(@, a) can be thought of as our procedure T *(a) whose © contains 
the single point 69. The choice of 09 can be made directly by considering the pre- 
change distribution which is closest to the post-change distributions because it is 
always more difficult to detect a smaller change. For our example, 6 = 1. 

An effective method to implement t (00, a) numerically can be found in [14]. 
Similarly, we can implement T* (a) as follows. Compute V, recursively by V, = 
max(V,—1 + log(2/0.8) — (2 — 0.8) X4, 0). Whenever V, = 0, one can begin a 
new cycle, discarding all previous observations and starting afresh on the in- 
coming observations, because for all 0.8<@<1,2<A<3 andl <k <n, 
> 4 (log A — logé)/(A — 0) — Xi) x 0 since (log — log@)/(A — 0) is maxi- 
mized at (8, à) = (0.8, 2). Now each time a new cycle begins compute at each 
stage n — 1,2,... 


QU = Xa tee + Xn-k+» k zm ]1,...,n. 
Then the procedure 7*(a) = first n such that Qe < ck for some k, where 


log X — log0 m 
à —8 A—0. 





= inf sup fk 
0.850 <1 2<) <3 


To further speed up the implementation, compute W, recursively by W, = 
max(W, 4 + log2 — Xn, 0). Stop whenever W, > p(0.8)a/1.2. Continue taking 


DETECTION WITH COMPOSITE PRE-CHANGE 115 


TABLE 2 
Comparison of two procedures in change-point problems with composite 
pre-change and composite post-change hypotheses 


T* (a) t(1,a) 
a 22.50 5.02 
0—1 6012-18 606 : 19 
EgN 0—09 1448 3:43 1207 +36 
8 — 0.8 3772 £116 2749 3:90 
1259 21.41+0.10 21.92+0.11 
1:25 18.09 +0.07 18.18 3: 0.09 
EN 129.5 15.08 +0.05 14.76 3: 0.06 
$209 13.75 3:0.04 13.22 +0.05 
153 12.29 +0.04 11.62 +0.04 


new observations (1.e., do not stop) whenever W, < p(1)a/2. If p(1)a/2 < Wn < 
p(0.8)a/1.2, then we will also stop at time n if Qm « ck for some k. The reasons 
behind this implementation are given below. 

First, if at time no we have W,,, > p(0.8)a/1.2 > 0, then there exists some Kp 
such that PEN (log2 — X,) > p(0.8)a/1.2. Thus for all 0 € [0.8, 1] and Ap = 2, 














NT (Beo Des e- J2 

p(8) i=kp ho —6 2 
2-6 p^ g 
5-50) 42 


Hence, T *(a) will stop at time no. Second, T* (a) will never stop at time n when 
Wn < p(1)a/2 because for 0; = 1, all 2 < X < 3, and all k, 


À—0) g (met mem x,) <* — 61 
——S3Mi———— — 5 Yos? Nm Wn <a 
pi) £X  X—6 J > p(0) & pa)” 


Table 2 provides a comparison of the performances for our procedure T*(a) 
with those of c (05, a). The threshold a for each of these two procedures is deter- 
mined from the criterion Eg—,; N (a) ~ 600. The results in Table 2 are based on 
1000 simulations for Eg N and 10,000 simulations for E; N. Note that for these 
two procedures, the detection delay EN = ELN. Table 2 shows that at a small 
additional cost of detection delay, T* (a) can significantly improve the mean times 
between false alarms compared to v (1, a). This is consistent with the asymptotic 
theory in this section. 





4. Normal distributions. Our general theory in Section 3 assumes that 
© and A are compact. If they are not compact, then our proposed procedures may 


116 Y MEI 


or may not be asymptotically optimal. However, we can still sometimes apply our 
ideas in these situations, as shown in the following example. 

Suppose we want to detect a change from negative to positive in the mean of 
independent normally distributed random variables with variance 1. In the.context 
of open-ended hypothesis testing, we want to test 


Ho:0€ © = (—0,0) against Hj:A€ A-(0,0oo). 
Let us examine the procedures T (a) defined in (3.6) for different choices of opti- 
mizer pairs. 

First, let us assume go(A) = AV with B > 1/2; then we have an optimizer pair 
p(0) —kg|0]^- 0/0 and g(A)=Al/? with kg —28*Qg — 0/0? 
(assume 0° = 1), and thus the procedure defined in (3.6) becomes £g(a) = first 

time n such that 
1 


2__ 92 
inf sup —— 
0<03>0 p(0) 





«J| >a where nte. Xu 


t=] 


(a — 0) Sn — s 


Letting 6 — 0 gives us that Sn > 0 if fg(a) = n, and rewriting the stopping rule as 
SX Ww 2X 2 
inf sup| - (^ — A) + (2 — 6) — = p(6)a| > 0. 
9005.0 n n n 


The supremum is attained at A = S,,/n, and so fg(a) = first time n such that for all 


0 «0, 
S 2 
— 0 +,/<p@)a. 
n n 

A routine calculation leads to 


fg(a) = inf{n > 1: Sp > afn! P}, 
This suggests using a stopping time of the form 


4.)  R@=intfa > 1: max S,- S00 - bP") > a? | 


to detect a change in mean from negative to positive. Observe that for B = 1, fg (a) 
is just the one-sided SPRT and 15(a) is just a special form of Page's CUSUM 
procedures. For f = 1/2, fg(a) and f$(a) have also been studied extensively in 
the literature, since they are based on the generalized likelihood ratio. Different 
motivation to obtain these two procedures can be found for fg (a) in Chapter IV 
of [28], which is from the viewpoint of the repeated significant test, and for » (a) 
in [29], which is from the viewpoint of the generalized likelihood ratio. For tg (a) 
with 0 < B < 1, see [3] and equation (9.2) on page 188 in [28]. 


DETECTION WITH COMPOSITE PRE-CHANGE 117 


Next, go(A) = 1 leads to 


9? 
ps and q(A)=1 


and 
to(a) = inf{n > a: S, > 0). 


Hence we use the following stopping time to detect a change in mean from negative 
to positive: 


(4.2) ig) = infln > a: max (Sp, - 5020]. 
O<k<n—a 


where the maximum is taken over 0 < k < n — a. It is interesting to see that 
f(a) and f}(a) can be thought of as the limits of g(a) and t3(a), respectively, 
as B — oo. 

Though one cannot use our theorems directly to analyze the properties of f% (a) 
and få (a), they are indeed asymptotically optimal to first order. For f > 1/2, first 
note that 


p(0)00() 21(4,0) | if0——Qp —1)4. 

By nonlinear renewal theory ([28], Chapters 9 and 10), 
Ey 03 (a) ~ a/q 2). 

Equation (13) on page 636 in [15] shows that for any 0 < 0, 

Po (ip(a) < oo) < exp(—(1 + o(1))p(8)a), 
and so Lemma 2.1 implies log Egt§ (a) ~ p(0)a as a — oo. Thus tf (a) is asymp- 
totically efficient at (6, 4) with 0 = —(26 — 1)A, and hence #3 (a) is asymptotically 
optimal to first order. Similarly, the asymptotic optimality property of të (a) can be 
proved directly since the structure of g(a) is very simple. 


REMARK. The above arguments establish the following optimality properties 
of fg (a) and tf (a). Suppose we want to test 


Hos;:0— —Qp — 1) against Ay 3:A=64, 


where f > 1/2 is given but à > 0 is unknown. Then fg(a) is an asymptotically 
optimal solution for all à > 0, while #3(a) is asymptotically optimal in the prob- 
lems of detecting a change from Ho 3 to Hj,3 for all ô > 0. As far as we know, no 
optimality properties of g(a) and #3(a) have been studied except for the special 
case of B = 1/2 or 1. Even for the case B = 1/2 which was studied in [29], our 
method is simpler and more instructive. 


118 Y. MEI 


5. Proof of Theorem 2.1. The basic idea in proving Theorem 2.1 is to relate 
the stopping time M (a) in (2.4) to new stopping times defined by 


(5.1) Mo(a) = infin > 1:5 log aoe 


i=l 
The proof of Theorem 2.1 is based on the following lemmas. 





— I(4,0)a > ol 


LEMMA 5.1. For all 0 € [89,01], 
Pe(M (a) < oo) x Po (Me (a) < oo) x exp(—1 (A, 0)a), 
and hence (2.7) holds. 
PROOF. The first inequality follows at once from the fact that M (a) > Ma(a) 


for all 0 € [69, 04], and the second inequality is a direct application of Wald's like- 
lihood ratio identity. C 


We now derive approximations for E; M (a). Similarly to (2.6), Me(a) in (5.1) 
can be written as 
Mo (a) =inf{n > 1: Sn > b'(A)a + (n — a)$ (0)). 


As we said earlier, the supremum in (2.6) is attained at 0 = 09 if n <a, and at 
0 = 0; if n > a, so that 


(5.2) (M(a) = m} = {M (a) = Ma, (a) = m] for all m <a. 
For simplicity, we omit a and 0, writing M = M (a) and My = Mo, (a) fork — 0, 1. 


LEMMA 5.2. Ása — oo, 


|. , $(01) — é (60) Lo. 
EAM) at Fo) ^4 Mo; Mo x a) + O(1). 


PROOF. Observe that 
EM =a — Ela —M; M xa)--Ei(M —-a; M >a), 


and by (5.2), EX (M — a; M <a) = E (Mo — a; Mo < a). Thus it suffices to show 
that 
b' (à) — (60) 
5.3) E,(M —a;M >a) = ————— RE, (a — Mo; Mo <a) + O(1). 
= ZORTI 


To prove this, define a stopping time 


Ng (u) infin > 1:9 (Xi - $(60) Zu}, 
[1 


DETECTION WITH COMPOSITE PRE-CHANGE 119 


for k — 0,1 and any u > 0. Assume a is an integer. (For general a, using [a], 
the largest integer < a, permits one to carry through the following argument with 
minor modifications.) By (5.2) we have 


0 
E (M —alM >a) «f E,(M — a|Ss — b'a =x, Mo >a) 


— 


x P; (Sa — b'(à)a € dx|Mo > a). 
Conditional on the event {Sa — b’(A)a = x, Mo > a}, 


M —a —inf(m: X441 +--+ Xatm + Sa > b'(A)a + mọ (01)] 
=inf|m Dan - $(81)) = b'(a — Sa = —x }, 


which is equivalent to N;(—x) since X1, X2,... are independent and identically 
distributed. Thus 


E (M —a|M >a) 


(5.4) 
= J E, Ni (—x)Ps (Sa — b/Q)a € dx|Mo > a). 
—OO 
Similarly, 
E, (Mo — a|Mo > a) 
(5.5) 


0 
E | E; No( —)P4 (Sa — b'(A)a € dx|Mo > a). 


Now for k = 0, 1 and any u > 0, define 
Nx (u) 
Ry(u) = >> (X, — &()) — 
j=] 
Then, by Theorem | in [13], 
sup E, Ry (u) < Ey (X1 — Ø (6) /(P (A) — $(6))) < oo. 


uz 


By Wald’s equation, (b' (2) — 6(6,))E, N; (u) = u + Ex R (u), so that 


Hu 
bn: (vto ] ese) SES 


for k — 0, 1. Hence, we have 


b'(A) — (6 
E, N1 (u) — TE Nou) 


sup 
u>0 


«x OQ. 








120 Y. MEI 


Plugging into (5.4), and comparing with (5.5), we have 


b'(A) — o(6 
E,(M —alM >a) = Fay pe EAM — a|Mo > a) + O(1). 


Relation (5.3) follows at once from the fact that (M > a} = (Mo > a} and the fact 
that E; (Mo — a) = O(1). Hence, the lemma holds. Œ 


LEMMA 5.3. Suppose Y, Y2,... are independent and identically distributed 
with mean u > 0 and finite variance o*. Define 


n 
Ne=inf|n> ÝN zal. 


i=] 
Then as a —— oo, 


e(Ż — Nai Na < 3 = val 
H I4 


o 
fax n o(1)), 


PROOF. The lemma follows at once from the well-known facts that as a — co, 
D d 
E(N,) = — ot O(1) and Var(N,) = (1 to). 

and that 
Na —a/u 
yao? / u? 

is asymptotically standard normal. See page 372 in [4], equation (5) in [27] and 

Theorem 8.34 in [28]. O 


PROOF OF THEOREM 2.1. Relation (2.7) is proved in Lemma 5.1. By (5.1) 
Mo = Mg, (a) can be written as 





“ I FG) 
Mg -—infíin21:» ————1o >a}. 
| DTG, ty) E fa c) 
By Lemma 5.3 it is easy to show that 
E, (a — Mo; Mo <a) Jal sit ) 
a — Mo; Mo € a) = va ; 
À 0 0 Jon 


where o9 = A/D (X)/(b' (A) — $(09)). Thus relation (2.8) holds by Lemma 5.2 and 
the definition of $ (0) in (2.5). LI 








DETECTION WITH COMPOSITE PRE-CHANGE 121 


Acknowledgments. This work is part of my Ph.D. dissertation at the Califor- 
nia Institute of Technology. I would like to thank my thesis advisor, Professor Gary 
Lorden, for his constant support and encouragement and Professor Moshe Pollak 
for sharing his insightful ideas. Thanks also to Professor Sarah Holte, Professor 
Jon A. Wellner, the Associate Editor and the referees for their helpful remarks. 


REFERENCES 


BASSEVILLE, M. and NIKIFOROV, I. (1993). Detection of Abrupt Changes: Theory and Appli- 
cation. Prentice-Hall, Englewood Cliffs, NJ. MR1210954 
[2] BERGER, R. L. and Hsu, J. C. (1996). Bioequivalence trials, 1intersection-union tests and equiv- 
alence confidence sets (with discussion). Statist. Sci. 11 283-319. MR1445984 
[3] CHOW, Y. S., HSIUNG, C. A. and LAI, T. L. (1979). Extended renewal theory and moment 
convergence in Anscombe’s theorem. Ann. Probab. 7 304—318. MR0525056 
[4] FELLER, W. (1971). An Introduction to Probability Theory and Its Applications 2, 2nd ed. 
Wiley, New York. MR0270403 
[5] GORDON, L. and POLLAK, M. (1995). A robust surveillance scheme for stochastically ordered 
alternatives. Ann. Statist. 23 1350-1375. MR1353509 
[6] GORDON, L. and POLLAK, M. (1997) Average run length to false alarm for surveil- 
lance schemes designed with partially specified pre-change distribution. Ann. Statist. 25 
1284—1310. MR1447752 
[7] KIEFER, J. and SACKS, J. (1963) Asymptotically optimum sequential inference and design. 
Ann. Math. Statist. 34 705—750. MRO150907 
[8] KRIEGER, A. M., POLLAK, M. and YAKIR, B. (2003) Surveillance of a simple linear regres- 
sion J Amer. Statist. Assoc. 98 456-469. MR1995721 
[9] LAI, T L (1995). Sequential change-point detection in quality control and dynamical systems 
(with discussion). J. Roy. Statist Soc. Ser. B 57 613—658. MR1354072 
[10] Lar, T. L. (1998). Information bounds and quick detection of parameter changes in stochastic 
systems. IEEE Trans. Inform. Theory 44 2917—2929 MR1672051 
[11] LAI, T. L. (2001). Sequential analysis: Some classical problems and new challenges (with 
discussion). Statist. Sinica 11 303—408. MR1844531 
[12] LEHMANN, E. L. (1959). Testing Statistical Hypotheses. Wiley, New York. MR0107933 
[13] LORDEN, G. (1970) On excess over the boundary. Ann. Math. Statist. 41 520—527. 
MR0254981 
[14] LORDEN, G. (1971). Procedures for reacting to a change 1n distribution. Ann. Math. Statist. 42 
1897-1908. MR0309251 
[15] LORDEN, G. (1973) Open-ended tests for Koopman-Darmois families. Ann. Statist 1 
633-643. MR0426318 
[16] MOUSTAKIDES, G. V. (1986). Optimal stopping times for detecting changes ın distributions. 
Ann. Statist. 14 1379-1387. MR0868306 
[17] PAGE, E. S. (1954). Continuous inspection schemes. Biometrika 41 100—115. MR0088850 
[18] POLLAK, M. (1978). Optimality and almost optimality of mixture stopping rules. Ann. Statist. 
6 910—916. MR0494737 
[19] POLLAK, M (1985). Optimal detection of a change 1n distribution. Ann. Stanst 13 206—227. 
MR0773162 
[20] POLLAK, M (1987). Average run lengths of an optimal method of detecting a change in distri- 
bution. Ann. Statist. 15 749—779. MR0888438 
[21] POLLAK, M. and SIEGMUND, D (1985) A diffusion process and its applications to detecting 
a change m the drift of Brownian motion Biometrika 72 267—280. MR0801768 


[1 


— 


122 
[22] 
[23] 
[24] 
[25] 


[26] 
[27] 


[28] 
[29] 
[30] 
[31] 
[32] 
[33] 


[34] 


Y. MEI 


POLLAK, M. and SIEGMUND, D. (1991). Sequential detection of a change in a normal mean 
when the initial value is unknown Ann Statist. 19 394—416. MR1091859 

RITOV, Y. (1990). Decision theoretic optimality of the CUSUM procedure. Ann. Statist. 18 
1464-1469. MR1062720 

ROBERTS, S. W. (1966). A comparison of some control chart procedures Technometrics 8 
411-430. MRO196887 

SHIRYAYEV, A. N. (1963). On optimum methods in quickest detection problems. Theory 
Probab Appl 8 22-46. 

SHIRYAYEV, A. N. (1978). Optimal Stopping Rules Springer, New York. MR0468067 

SIEGMUND, D. (1969) The variance of one-sided stopping rules. Ann Math Statist. 40 
1074-1077. MR0243626 

SIEGMUND, D. (1985). Sequential Analysis Tests and Confidence Intervals Springer, 
New York. MRO799155 

SIEGMUND, D. and VENKATRAMAN, E. S (1995). Using the generalized likelihood ratio sta- 
tistics for sequential detection of a change-point. Ann. Statist. 23 255—271. MR1331667 

STOUMBOS, Z., REYNOLDS, M R. JR., RYAN, T. P. and WOODALL, W, H. (2000). The state 
of statistical process control as we proceed into the 21st Century. J. Amer. Statist. Assoc. 
95 992-998. 

VAN DOBBEN DE BRUYN, C. S. (1968). Cumulative Sum Tests. Hafner, New York. 

WALD, A. and WOLFOWITZ, J. (1948). Optimum character of the sequential probability ratio 
test. Ann. Math. Statist. 19 326-339. MR0026779 

YAKIR, B. (1998). On the average run length to false alarm in surveillance problems which 
possess an invariance structure. Ann. Statist. 26 1198-1214. MR1635389 

YAKIR, B., KRIEGER, A. M and POLLAK, M. (1999). Detecting a change in regression: First 
order optimality. Ann. Statist. 27 1896-1913. MR1765621 


FRED HUTCHINSON CANCER RESEARCH CENTER 
1100 FAIRVIEW AVENUE NORTH, M2-B500 
SEATTLE, WASHINGTON 98109 

USA 

E-MAIL ymei@fherc.org 


The Annals of Statistics 

2006, Vol 34, No 1, 123-145 

DOI 10 1214/009053605000000912 

© Institute of Mathematical Statrsncz, 2006 


CONSISTENT ESTIMATION OF THE BASIC NEIGHBORHOOD 
OF MARKOV RANDOM FIELDS 


BY IMRE CSISZÁR! AND ZSOLT TALATA? 
Hungarian Academy of Sciences 


For Markov random fields on Z4 with finite state space, we address the 
statistical estumation of the basic neighborhood, the smallest region that de- 
termines the conditional distribution at a site on the condition that the values 
at all other sites are given A modification of the Bayesian Information Crite- 
rion, replacing likelihood by pseudo-likelihood, 1s proved to provide strongly 
consistent estimation from observing a realization of the field on increasing 
finite regions: the estimated basic neighborhood equals the true one eventu- 
ally almost surely, not assuming any prior bound on the size of the latter 
Stationanty of the Markov field is not required, and phase transition does not 
affect the results. 


1. Introduction. In this paper Markov random fields on the lattice Z^ with 
finite state space are considered, adopting the usual assumption that the finite- 
dimensional distributions are strictly positive. Equivalently, these are Gibbs fields 
with finite range interaction; see [13]. They are essential in statistical physics, for 
modeling interactive particle systems [10], and also in several other fields [3], for 
example, in image processing [2]. 

One statistical problem for Markov random fields is parameter estimation when 
the interaction structure is known. By this we mean knowledge of the basic neigh- 
borhood, the minimal lattice region that determines the conditional distribution at 
a site on the condition that the values at all other sites are given; formal defini- 
tions are in Section 2. The conditional probabilities involved, assumed translation 
invariant, are parameters of the model. Note that they need not uniquely determine 
the joint distribution on Z?, a phenomenon known as phase transition. Another 
statistical problem is model selection, that is, the statistical estimation of the inter- 
action structure (the basic neighborhood). This paper is primarily devoted to the 
Jatter. 

Parameter estimation for Markov random fields with a known interaction struc- 
ture was considered by, among others, Pickard [19], Gidas [14, 15], Geman and 


Received December 2003, revised April 2005. 
1 Supported by Hungarian National Foundation for Scientific Research Grants T26041, T32323, 
TS40719 and T046376. 
2Supported by Hungarian National Foundation for Scientific Research Grant T046376. 
AMS 2000 subject classifications Primary 60660, 62F12; secondary 62M40, 82B20. 
Key words and phrases. Markov random field, pseudo-likelihood, Gibbs measure, model selec- 
tion, information criterion, typicality. 


123 


124 L CSISZÁR AND ZS. TALATA 


Graffigne [12] and Comets [6]. Typically, parameter estimation does not directly 
address the conditional probabilities mentioned above, but rather the potential. 
This admits parsimonious representation of the conditional probabilities that are 
not free parameters, but have to satisfy algebraic conditions that need not concern 
us here. For our purposes, however, potentials will not be needed. 

We are not aware of papers addressing model selection in the context of Markov 
random fields. In other contexts, penalized likelihood methods are popular; see 
[1, 21]. The Bayesian Information Criterion (BIC) of Schwarz [21] has been 
proven to lead to consistent estimation of the “order of the model” in various cases, 
such as i.i.d. processes with distributions from exponential families [17], autore- 
gressive processes [16] and Markov chains [11]. These proofs include the assump- 
tion that the number of candidate model classes is finite; for Markov chains this 
means that there is a known upper bound on the order of the process. The con- 
sistency of the BIC estimator of the order of a Markov chain without such prior 
bound was proved by Csiszár and Shields [8]; further related results appear in [7]. 
A related recent result, for processes with variable memory length [5, 22], is the 
consistency of the BIC estimator of the context tree, without any prior bound on 
memory depth [9]. 

For Markov random fields, penalized likelihood estimators like BIC run into 
the problem that the likelihood function cannot be calculated explicitly. In addi- 
tion, no simple formula is available for the "number of free parameters" typically 
used in the penalty term. To overcome these problems, we will replace likelihood 
by pseudo-likelihood, first introduced by Besag [4], and modify also the penalty 
term; this will lead us to an analogue of BIC called the Pseudo-Bayesian Informa- 
tion Criterion or PIC. Our main result is that if one minimizes this criterion for 
a family of hypothetical basic neighborhoods that grows with the sample size at 
a specified rate, the resulting PIC estimate of the basic neighborhood equals the 
true one eventually almost surely. In particular, the consistency theorem does not 
require a prior upper bound on the size of the basic neighborhood. It should be 
emphasized that the underlying Markov field need not be stationary (translation 
invariant), and phase transition causes no difficulty. 

An auxiliary result perhaps of independent interest is a typicality proposition 
on the uniform closeness of empirical conditional probabilities to the true ones, 
for conditioning regions whose size may grow with the sample size. Though this 
result is weaker than analogous ones for Markov chains in [7], it will be sufficient 
for our purposes. 

The structure of the paper is the following. In Section 2 we introduce the basic 
notation and definitions, and formulate the main result. Its proof is provided by the 
propositions in Sections 4 and 5. Section 3 contains the statement and proof of the 
typicality proposition. Section 4 excludes overestimation, that is, the possibility 
that the estimated basic neighborhood properly contains the true one, using the 
typicality proposition. Section 5 excludes underestimation, that is, the possibility 
that the estimated basic neighborhood does not contain the true one, via an entropy 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 125 


argument and a modification of the typicality result. Section 6 is a discussion of 
the results. The Appendix contains some technical lemmas. 


2. Notation and statement of the main results. We consider the d-dimensio- 
nal lattice Z. The points i € Z are called sites, and ||il| denotes the maximum 
norm of i, that is, the maximum of the absolute values of the coordinates of i. The 
cardinality of a finite set A is denoted by | A]. The notation C and C of inclusion 
and strict inclusion are distinguished in this paper. 

A random field is a family of random variables indexed by the sites of the lat- 
tice, (X (i) :i € Z4}, where each X (i) is a random variable with values in a finite 
set A. For A € ZZ, a region of the lattice, we write X (A) = (X (i):i € A}. For the 
realizations of X (A) we use the notation a(A) = (a(i) € À:i € A}. When A is 
finite, the |A|-tuples a(A) € A^ will be referred to as blocks. 

The joint distribution of the random variables X (i) is denoted by Q. We assume 
that its finite-dimensional marginals are strictly positive, that is, 


Q(a(A)) = Prob{X(A)=a(A)}>0 . for ^ C Z finite, a(A) € A^. 


The last standard assumption admits unambiguous definition of the conditional 
probabilities 


Q(a(A)la($)) = Prob(X (A) =a (A)| X ($) = a(%)} 


for all disjoint finite regions A and d. 

By a neighborhood Y (of the origin 0) we mean a finite, central-symmetric set 
of sites with 0 é I’. Its radius is r (T) = maxjer lil. For any AC Zí its translate 
when 0 is translated to i is denoted by A'. The translate I" of a neighborhood T 
(of the origin) will be called the l'-neighborhood of the site i; see Figure 1. 











Ti 














log™ Anl] | 


FiG. 1. The V-neighborhood of the site i, and the sample region An. 


126 I. CSISZÁR AND ZS. TALATA 


A Markov random field is a random field as above such that there exists a neigh- 
borhood T, called a Markov neighborhood, satisfying for every i € Z4 


(2.1) Q(aG)la(A)) = Q(aG)a)) — itA2T,0£ A, 


where the last conditional probability is translation invariant. 
This concept is equivalent to that of a Gibbs field with a finite range interaction; 
see [13]. Motivated by this fact, the matrix 


Qr = (Qr(ala(T)):a € A, a(T) e Al} 


specifying the (positive, translation-invariant) conditional probabilities in (2.1) 
will be called one-point specification. All distributions on AZ" that satisfy (2.1) 
with a given conditional probability matrix Qr are called Gibbs distributions with 
one-point specification Qr. The distribution Q of the given Markov random field 
is one of these; Q is not necessarily translation invariant. 

The following lemma summarizes some well-known facts; their formal deriva- 
tion from results in [13] 1s indicated in the Appendix. 


LEMMA 2.1. Fora Markov random field on the lattice as above, there exists 
a neighborhood Vo such that the Markov neighborhoods are exactly those that 
contain lo. Moreover, the global Markov property 


Q(a(A)\a(Z4 V A)) = Q («caso Urs\ 3) 
ic^ 
holds for each finite region ^ C Z*. These conditional probabilities are translation 
invariant and uniquely determined by the one-point specification Qr,. 


The smallest Markov neighborhood To of Lemma 2.1 will be called the basic 
neighborhood. The minimal element of the corresponding one-point specification 
matrix Qr, is denoted by qmi: 

Qmn- min . Qre(ala(To)) > O. 
a€A,a(l'g)eATo 

In this paper we are concerned with the statistical estimation of the basic neigh- 
borhood I'o from observation of a realization of the Markov random field on an in- 
creasing sequence of finite regions A, C ZZ, n € N; thus the nth sample is x (An). 

We will draw the statistical inference about a possible basic neighborhood T 
based on the blocks a(T) € A! appearing in the sample x(A,). For technical 
reasons, we will consider only such blocks whose center is in a subregion A, 
of Apn, consisting of those sites i € A, for which the ball with center i and radius 
log!/@4) |A,| also belongs to A: 


An = {i € Ani {j € Za: lli — jl] < log €? Anl} € An}; 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 127 


see Figure 1. Our only assumptions about the sample regions A, will be that 
AC NC <<; [Anl/|An| > 1. 


For each block a(T) € Al, let N,(a(T)) denote the number of occurrences of 
the block a(T') in the sample x(A,,) with the center in Aj, 


N.(a(T)) = |(i € An: It € An, x( T) 2 a(D))]. 


The blocks corresponding to l'-neighborhoods completed with their centers will 
be denoted briefly by a(I’, 0). Similarly as above, for each a(L', 0) e ATV} we 
write 


N,(a(T,0)) = |[i € A4:T' € Ag, x(T* U (i) = a(7,0)]]. 


The notation a(T', 0) € x(A4) will mean that N,(a(T', 0)) > 1. 

The restriction I" C A, in the above definitions is automatically satisfied if 
rr) < log!/ (24) | A 4|. Hence the same number of blocks is taken into account for 
all neighborhoods, except for very large ones: 


>> N.a(D)-—|Aa| — ifr(D) < log’ |A,]. 
a(l)eAU 


For Markov random fields the likelihood function cannot be explicitly deter- 
mined. We shall use instead the pseudo-likelihood defined below. 

Given the sample x(A,,), the pseudo-likelihood function associated with a 
neighborhood T is the following function of a matrix Q1. regarded as the one- 
point specification of a hypothetical Markov random field for which T is a Markov 
neighborhood: 


PLr(x(An), Qr) = [T Ope @lx")) 
1€A, 


= |] aO aT". 


a(T,0)ex(A4) 


(2.2) 


We note that not all matrices QT. satisfying 


?.0r(a(0la(D) -1, aT) eA" 
acá 
are possible one-point specifications; the elements of a one-point specification ma- 
trix have to satisfy several algebraic relations not shown here. Still, we define the 
pseudo-likelihood also for Q7. not satisfying those relations, even admitting some 
elements of Qp to be 0. 
The maximum of this pseudo-likelihood is attained for QL.(a(0)|a(D)) = 
E. Thus, given the sample x (A4), the logarithm of the maximum pseudo- 


128 L CSISZÁR AND ZS. TALATA 


likelihood for the neighborhood P^ is 


Nr (aU, 0)) 


(23)  logMPLr(x(A)- 9».  N.(a(T,0))log N aT» 


a(T,0)ex (An) 


Now we are able to formalize a criterion in analogy to the Bayesian Information 
Criterion that can be calculated from the sample. 


DEFINITION 2.1. Given a sample x(A,), the Pseudo-Bayesian Information 
Criterion, in short PIC, for the neighborhood F is 


PICr (x (A5)) = — log MPLr(x(A4)) + [Al log | A,]. 


REMARK. In our penalty term, the number |All!!! of possible blocks 
a(V) € A! replaces “half the number of free parameters" appearing in BIC, for 
which number no simple formula is available. Note that our results remain valid, 
with the same proofs, if the above penalty term is multiplied by any c > 0. 


The PIC estimator of the basic neighborhood Vp is defined as that hypotheti- 
cal I for which the value of the criterion is minimal. An important feature of our 
estimator is that the family of hypothetical I''s is allowed to extend as n > oo, 
and thus no a priori upper bound for the size of the unknown Tọ is needed. Our 
main result says the PIC estimator is strongly consistent if the hypothetical F’s are 
those with r (T) € r4, where r, grows sufficiently slowly. 

We mean by strong consistency that the estimated basic neighborhood equals Ip 
eventually almost surely as n — oo. Here and in the sequel, "eventually almost 
surely" means that with probability 1 there exists a threshold no [depending on the 
infinite realization x(Z^)] such that the claim holds for all n > no. 


THEOREM 2.1. The PIC estimator 


Ppic(x(An)) = argmin PICr(x(A5)), 
DL'r(DP)xra 


with 
rn = o(log/ 9 | Anl), 
satisfies 
Ppic(x(An)) ^ To 


eventually almost surely as n — oo. 


PROOF. Theorem 2.1 follows from Propositions 4.1 and 5.1 below. U 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 129 


REMARK. Actually, the assertion will be proved for 7, equal to a constant 
times log!/C2) | A, |. However, as this constant depends on the unknown distribu- 
tion Q, the consistency can be guaranteed only when 

f, — o(log!/C® (Anl) = o(log" € |A, 1). 


It remains open whether consistency holds when the hypothetical neighborhoods 
are allowed to grow faster, or even without any condition on the hypothetical neigh- 
borhoods. 


As a consequence of the above, we are able to construct a strongly consistent 
estimator of the one-point specification Qr,. 


COROLLARY 2.1. The empirical estimator of the one-point specification, 
Nn (a (f, 0)) 
Na(a(U)) ' 


converges to the true Qr, almost surely as n — oo, where T is the PIC estima- 
tor l'pic. 


Or (a(0)a(f)) = a(0) € A, aD e Al, 





PROOF. Immediate from Theorem 2.1 and Proposition 3.1 below. [L 
3. The typicality result. 


PROPOSITION 3.1. Simultaneously for all Markov neighborhoods with 
r (T) < ai/02 Jog!/24) | A. | and blocks a(T, 0) e ATU, 


N, (a(r, 0)) | K log Nn(a(T)) 
a doo NN Ola(T Zoe n n 
maay ` COD) <y Nam) 


eventually almost surely as n — oo, if 


O<a<1,  k2"ealog(|A[? +1). 


To prove this proposition we will use an idea similar to the “coding technique” 
of Besag [3]; namely, we partition A, into subsets A* such that the random vari- 
ables at the sites i € A* are conditionally independent given the values of those at 
the other sites. First we introduce some further notation. Let 


(3.4) Ry = [a P [log |Aí[] 999 |. 


We partition the region A, by intersecting it with sublattices of Z4 such that the 
distance between sites in a sublattice is 4R, + 1. The intersections of A4 with these 
sublattices will be called sieves. Indexed by the offset k relative to the origin 0, the 
sieves are 


Ak = {ic Anii =k + (4Rn + 1)v, v€ Zf}, Jk] <2Rn; 


130 I. CSISZÁR AND ZS TALATA 





FIG 2. The sieve A* 


see Figure 2. For a neighborhood I’, let N* (a(D)) denote the number of occur- 
rences of the block a(T) € A! in the sample x (A4) with center in AX, 


N$ (a(D)) «| € ARI € An, xT") — a(D))I. 


similarly, let 
NE (a(T, 0)) =|{i € AF:T' C An, x (P U (i) 2 a(7,0)]]. 
Clearly, 
N.(a(D)- $ Nia) and N,(a(D,0)— $ Ne (a(T,0)). 


k [kl S 2R, k:||&] x2 Rn 


The notation a(T) € x(AE) will mean that Nk (aT) > 1. 
Denote by $, (I) the set of sites outside the neighborhood [ whose norm is at 
most 2R,,, 


®, (0) = (i € Z7 : Jil <2R,,i £T]; 


see Figure 2. $} (T) denotes the translate of d», (D) when 0 is translated to i. 

For a finite region € C ZZ, conditional probabilities on the condition X (E) = 
x(&) € A® will be denoted briefly by Prob{- | x(&)]. 

In the following lemma the neighborhoods I need not be Markov neighbor- 
hoods. 


LEMMA 3.1. Simultaneously for all sieves k, neighborhoods TU with 
r(T) < R, and blocks a(T) € A’, 
(1 +e) log NA (a(T)) > log | As], 


eventually almost surely as n — oo, where £ > 0 is an arbitrary constant. 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 131 


PROOF. Asaconsequence of Lemma 2.1, for any fixed sieve k and neighbor- 
hood D with r(T) < Rp, the random variables X T"), i € A‘, are conditionally 
independent given the values of the random variables in the rest of the sites of the 
sample region ^An. By Lemma A.5 in the Appendix, 


Q(a(T)la($, (D) > gi a($,(T)) e AMO), 


min? 
hence we can use the large deviation theorem of Lemma A.3 in the Appendix with 
Ps = d to obtain 








Ni(a(T) | 1 m SUE. 
Probi ————— <- is r7]? < —jA zd 
| Mp Cae eA U sem 
Hence also for the unconditional probabilities, 
Nya)  ! |j eed 
Prob | EU < zalh) = exp| —|Ak 1an | 
Note that for n > no (not depending on k) we have 
Atja 1 dől Än 


72(4R,4 104 ^ (SR 


Using this and the consequence |I| < (2R, + D < (3R,Y of r(T) < Ry, the last 
probability bound implies for n > no 





d d 
N*(a(T)) q GR) " qR» 
Pr b LIE II eae nun < |- A uin |. 
o| \An| < fn | sen BETTE 


Using the union bound and Lemma A.6 in the Appendix, it follows that 


d 
Neal) | dmm 
lA — 2 Rn)” 





Prob] 


for some k, T, a(r) with |k|| x 2R4,,r(T) < R4,a(D)e ar] 


(3Rn)4 
- | Qmm d 2 (2R44-1)4 /2 
« —| Aa Ee. — (4R +D- (IA 1“™ ; 
< exp] | alae | (4R, -- 1 - (AP 4- 1) 
Recalling (3.4), this is summable in n, and thus the Borel-Cantelli lemma gives 


3491/7 (1-Hlog | A41)? 
AF (dry = Agi sies 
RARE 13 Sto + log |A,|)!/4 
eventually almost surely as n — oo, simultaneously for all sieves k, neighborhoods 
I with r(D) < Rn and blocks a(I) € A’. This proves the lemma. O 


132 I. CSISZÁR AND ZS. TALATA 


LEMMA 3.2. Simultaneously for all sieves k, Markov neighborhoods T with 
r(T) < R, and blocks a(Y, 0) e AP Ut}, 


NE (a(T, 0) Slog'/* NE (a(l) 
N*(a(T)) - ayer) | Nka(D) ' 


eventually almost surely as n — oo, if 


8 > Xea! log A|^ + 1). 


PROOF. Given a sieve k, a Markov neighborhood T and a block a(T', 0), the 
difference N*(a(I, 0)) — NE (a (TY) Q(a(0)la(T)) equals 


Yn = 2. [I(X G) = a(0)) — Q(a(0)a(7))], 
ieA* . x (P)-a(T) 


where I(-) denotes the indicator function; hence the claimed inequality is equiva- 
lent to 


-y NE(a(T))81og!? NE(a(T)) < Yn < Y NE (a (T))8 log!/? NE (a(T)). 


We will prove that the last inequalities hold eventually almost surely as n — oo, 
simultaneously for all sieves k, Markov neighborhoods I’ with r(T) < R, and 
blocks a(L', 0) e ATO, We concentrate on the second inequality; the proof for 
the first one is similar. 

Denote 


Gik aT, 0) — |. mix. Ya > e jn. 
where 
X, (k, a(T)) = {n:e} < NE(a(D)) < e+, 0 +8) log NA(a(D)) > loglÄnl}; 
if n € Nj(k, a(D)), then by (3.4) 


(3.5) Ry = |a VCD flog A, [1/09 | « 41/02 (1. (4. 2)(j e 1) /99 E RO, 


The claimed inequality Y, < \/ Nt (a(T))8 log"? N* (a(T)) holds for each n 


with e? < N*(a(T)) x e7* if 


max Y, < J eJ8j1/?. 


n'el «NE(a(T)) xe/*! 


By Lemma 3.1, the condition (1 + £) log N* (a(T)) > log [An] in the definition of 
Nj (k, a(U)) is satisfied eventually almost surely, simultaneously for all sieves k, 
neighborhoods l with r(T) < Rn and blocks a (T) € A! . Hence it suffices to prove 
that the following holds with probability 1: the union of the events G , (k, a(I’, 0)) 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 133 


for all k with |k|| x 2RY), all P 2 To with r(T) < RY and all a(T, 0) e APU, 
obtains only for finitely many j. i 
As n € N, (k, a(U)) implies j « log |A4]| < +e) +1), 


LO5)G TD] 


i "i ve ix 0) 5 U NON ens y 


l=) 
where 
Nj i(k, a(D)) = {n:e} < NE(a(D)) Se 5,1 <log|An| <1 +1). 


The random variables X (i), i € A‘, are conditionally independent given the val- 
ues of the random variables in their l'-neighborhoods. Moreover, those X (i)’s for 
which the same block a(T) appears in their l'-neighborhood are also conditionally 
iid. Hence Y, is the sum of NE (a(T)) conditionally i.i.d. random variables with 
mean 0 and variance 


} > D’ = Q(a(0)a(T))[1 — Q(a(01a(7))] > 5 ¢mun- 


As Ry is constant for n with | < log|A,| </ + 1, the corresponding Y,,’s are actu- 
ally partial sums of a sequence of N‘; (a(T)) < e/*! such conditionally i.i.d. ran- 
dom variables, where n* is the largest element of N; (k, a(T)). Therefore, using 


Lemma A.4 in the Appendix with u = u; = (1 — n) e-18jV?, where n > 0 is an 
arbitrary constant, we have 


Prob ax Y,zJe8j!2 r' 
ies n Zyé oJ « Ed ) 


1€ A : x(T')=a (T) 


« Prob max  Y,> DvVelti((1 —mJe-l8j!/2 +2) 


neWN, (k,a(U)) 
| ( e | 
ic Ac x(I')-a(T) 
A 


< Cell 
3 2 + u; /QDA ert!yy 


On account of lim o5 4; / 2D ej *1 ) = 0, the last bound can be continued for 
j > Jo, as 

2 
(=n) aj. 


8 
< — eX 
=g 7 2e(1 +7) 


134 I. CSISZAR AND ZS. TALATA 


This bound also holds for the unconditional probabilities, hence we obtain 
from (3.6), 


, 8 (1 —n)? 1/2 
Prob{G ,(k, a(r, 0))} < (ej -2)- zex EL " 
[G (k, aT, 0))} x (ej pes p delat)” 
(1— n S i 
< exp| - 2— —— 3j! |, 
~ | 2e(1 Fn) 
To bound the number of all admissible k, T, a(U, 0) [recall the conditions ||k|| < 


2RO), r(T) < RO), with RY defined in (3.5)], note that the number of possible 
k's is bounded by 


(4RY) a 1) < (4 4- otata te) (j 4 Due 


and, by Lemma A.6 in the Appendix, the number of possible blocks a(I', 0) with 
r(T) < RO? is bounded by 


(lA + jeg ento < (AP + py 0424 Ta! Pape) Poe 


Combining the above bounds, we get for the probability of the union of the 
events G ; (k, a(U, 0)) for all admissible k, T, a(T', 0) the bound 


|- (1 — 9? .1/2 
j 2e(1 +n) 


+ flog(|Al2 + DIC + 942471121 +8) + 1/2 + O dog j", 


This is summable in j if we choose 7, €, p sufficiently small, and 6/(2e) > 
24-1g1/?log(|A|? + 1), that is, if 8 > fea! log(|A|? +1). O 


PROOF OF PROPOSITION 3.1. Using Lemma 3.2, 





N,(a(T, 0)) 
ET. - Q(« (Ola) 
N* (a(T, 0)) N* (a(T)) 
EE ium Nat lod e nn 
Saag, | NEC Q(a(0)|a(T)) N GC» 


L ys [Permen xoc) 
k WIR, Nk (a(T)) N,(a(D)) 


eventually almost surely as n — oo. By Jensen's inequality and N*(a(I)) < 
Ng, (a(1)), this can be continued as 


[84 Rs + D og! Nala (D) 
^ Nn (aD) | 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 135 


By (3.4) and Lemma 3.1, we have for any e, p > 0 and n sufficiently large, 
(4R, +1)? < (441/02 (1 + og |A,[) CP +1)? 


< (4+ py'a! ^ Q + e)! log”? Na (a (D), 
eventually almost surely as n — oo. This completes the proof. (J 


4. The overestimation. 


PROPOSITION 4.1. Eventually almost surely as n — oo, 


Pic A9) € (P: P D T), 
whenever rp in Theorem 2.1 is equal to Rn in (3.4) with 
dw» — |A|l-1 
23de |A|? log(|A|2 + 1): 
PROOF. We have to prove that simultaneously for all neighborhoods [ > I9 
with r(T) < Rn, 
(4.7) PICr (x (A4)) — PICr,(x (A5)) > 0, 
eventually almost surely as n — oo. 
The left-hand side 
—log MPLr(x(A4)) + All log | A] + log MPLr, (x (A4)) — |Al!"! log | A;.| 
is bounded below by 


1 
— log MPLr (x (A5) +logPLry(x(An), Ors) + (1 - ui) AJ Hog |Anl. 


Hence, it suffices to show that simultaneously for all neighborhoods F > To with 
r(T) < Rn, 
A| — 1 

(4.8) logMPLr( (As) — logPLr (s), Org) < 7 — LAI!" log Asl 
eventually almost surely as n — oo. 

Now, for I > To we have PLro(x (An), Qro) = PLr(@(An), Qr), by the def- 
inition (2.2) of pseudo-likelihood, since Ip is a Markov neighborhood. Thus the 
left-hand side of (4.8) equals 


log MPLr (x (A5)) — logPLr(x(An), Qr) 


Nn (a (P, 0) / Ns (aQU)) 
= N, (a (E, O) log —————————— 
da " ) Q(a(0)la(DP)) 


= 5 Naat) 
a(lyex(A,) 
Na(a(V,0) , — Na (a(U,0)/N5 (a(D)) 
a(o):a(royex(A,) Naa) Q(a(0)la(U)) 


x 


136 I. CSISZÁR AND ZS. TALATA 


To bound the last expression, we use Proposition 3.1 and Lemma A.7 in the Appen- 
dix, the latter applied with P(a(0)) = YERS, Q(a(0)) = Q(a(0)la(T)). Thus 
we obtain the upper bound 





1 2 
Y^ Nat - O(a (T) 


a(T)ex(Ag) MUM 6(0)- G(P.0)ex(A,) 
1 |. xlogN,(G(T) «IAI : 
< Y) MeT) « JAI og Asl, 
a(l)ex(Ags) min n (at )) Qmin 


Ns (a(T, 0)) 
2 | Ny, (a(DP)) 


eventually almost surely as n — oo, simultaneously for all neighborhoods F > Ig 
with r(I) < Rp. : 
Hence, since | A,|/| A4| — 1, the assertion (4.8) holds whenever 
k|A] |A|-1 
< 
Qmm |A] 
which is equivalent to the bound on œ in Proposition 4.1. 0] 








? 


5. The underestimation. 


PROPOSITION 5.1. Eventually almost surely as n — oo, 
Tpic(X (A5)) € (P: T 2 T0), 


if r4 in Theorem 2.1 is chosen as in Proposition 4.1. 


Proposition 5.1 will be proved using the lemmas below. Let us denote 


Wo = | J ri) vr U (0). 


ielo 


LEMMA 5.1. The assertion of Proposition 3.1 holds also with T replaced by 
I U Wo, where T is any (not necessarily Markov) neighborhood. 


PROOF. As Proposition 3.1 was a consequence of Lemma 3.2, we have to 
check that the proof of that lemma works when the Markov neighborhood T is re- 
placed by l U Wo, where I is any neighborhood. To this end, it suffices to show that 
conditional on the values of all random variables in the (T U Vo)-neighborhoods 
of the sites i € AF, those X (i), i € A*, are conditionally i.i.d. for which the same 
block a(T U Vo) appears in the (T U Vo)-neighborhood of i. This follows from 
Lemma A.1 in the Appendix, with A — l'o U (0) and V = Yọ. H 


LEMMA 5.2. Simultaneously for all neighborhoods I 3 Uo with r(U) < Rn, 
PICruw, (x A&)) > PlCironrguuwS GcCAn)); 


eventually almost surely as n — oo. 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 137 


PROOF. The claimed inequality is analogous to (4.7) in the proof of Proposi- 
tion 4.1, the role of T > To there played by F U Wg D (P N To) U Yo. Its proof 
is the same as that of (4.7), using Lemma 5.1 instead of Proposition 3.1. Indeed, 
the basic neighborhood property of Fo was used in that proof only to show that 
PLpr,(x (As), Ory) = PLr(x(An), Qr). The analogue of this identity, namely 

PL(rarguws (x (An), Qvmnrguy) = PLruw, (x (An), Oru), 
follows from Lemma A.1 in the Appendix with A = Tọ U {0} and V = Wp. DO 


For the next lemma, we introduce some further notation. 

The set of all probability distributions on AT , equipped with the weak topology, 
is a compact Polish space; let d denote a metric that metrizes it. Let QC denote the 
(compact) set of Gibbs distributions with the one-point specification Qr. 


For a sample x (A4), define the — distribution on AZ by 


Rx d 77 eem ne p» Óy 
kii 


where x, € AZ" is the extension of the sample x(A4) to the whole lattice with 
Xn(j) equal to a constant a € A for j € Z4 An, and x, denotes the translate of x, 
when 0 is translated to i and 4, 1s the Dirac mass at x € AT. 


LEMMA 5.3. With probability 1, d(Rx n, Q^) — 0. 


PROOF. Fix a realization x(Z) for which Proposition 3.1 holds. 
It suffices to show that for any subsequence ng such that Ry pn, converges, its 
limit Ryo belongs to QS, 
Let F” be any neighborhood. For n sufficiently large, the (I U {0})-marginal 
of Rx n is equal to 
Nn (a(r, 0 
| ALIE RT ind! 
| An] 
hence Ry,4, — Rx,o implies 
Nn, (a(r, 0)) 
[Anl 
for all a(T”, 0) € AT UU. This and summation for a(0) € A imply 
im (a(T”, 0)) 
Nn, (a(T")) 


As Proposition 3.1 holds for the realization x (Z^), it follows that if T” isa ay 
neighborhood, then 


Rx o(a) la T^) = Q(a(01a07)) = Or, (a(0)|a(T0)). 


(5.9) Ry ola (T, 0)) 


— Ry,o(a(0)la(1")). 


138 I. CSISZÁR AND ZS. TALATA 


For any finite region A > Io with 0 ¢ A, the last equation for a neighborhood 
I" > A implies that 
R, olala (A)) = Qro (a(Dla(Tg) if ADT, O¢ A. 

To prove Ryo € Q° it remains to show that, in addition, R, o(aG)a(A')) = 
Or, (a(i)ja(T9)). Actually, we show that R,o is translation invariant. Indeed, 
given a finite region A C Zf and its translate A‘, take a neighborhood T” with 
AU A? CT" U {0}, and consider the sum of the counts N,,(a(T’, 0)) for all blocks 
a(V',0) = (a():j € T” U {0}} with (a(j):j € A} equal to a fixed |A|-tuple 
and the similar sum with (a(j):j € A!) equal to the same |Aj-tuple. If [iij] < 
log! 2A) 1A, |, the difference of these sums is at most |A,|—|A,|, hence the trans- 
lation invariance of Rx o follows by (5.9). LJ 


LEMMA 5.4. Uniformly for all neighborhoods T not containing To, 
— log MPLrorouw (x (An)) > — log MPLr Œ (An)) + clAnl, 


eventually almost surely as n — oo, where c > 0 is a constant. 


PROOF. Given a realization x € AZ with the property in Lemma 5.3, there 
exists a sequence Q, , in Q° with 
d(Ry,n, Qx.n) -— 0, 
and consequently 
Nn lata 
(5.10) — A — Qx,n(a(^)) > 0 
n 


for each finite region A C Zí and a(A) e A^. 
Next, let I' be a neighborhood with I A Ip. By (2.3), 


— m log MPL rnryuYo (x (An ) 
Anl 
1 
—--— 5 Nn (a(t N Po) U Yo, 0)) 

[Anl a((TNlp)UWo,0)Ex(An) 

Nn(a((P N To) U Wo, 0)) 
Nn(a( N To) U Vo)) 

Applying (5.10) to A = (T N To) U Yo U {0}, it follows that the last expression is 
arbitrarily close to 


- 2c Qx.n(a((T ' Po) U Vo, 0)) log Qs. (a(0)]a ((T à To) U Wo)) 
a((UCOP9)UWwgU(0)) 


= Ho, ,(X(0)|X(( Np) U Wo) 


x log 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 139 


if n is sufficiently large, where Ho. , (-|-) denotes conditional entropy, when the 
underlying distribution is Qx, 4. Similarly, —(1/|A5]) log MPLr, (x(A4)) is arbi- 
trarily close to Ho, „(X (0)| X (T'o)), which equals Ho. , (X (0)| X (To U Wo)) since 
Io is a Markov neighborhood. 

It is known that Hg/(X(0)|X(( N To) U Vo)) = Hø (X (00| X (Uo U V9)) for 
any distribution Q'. The proof of the lemma will be complete if we show that, in 
addition, there exists a constant € > 0 (depending on I N To) such that for every 
Gibbs distribution QC €e Q° 


Hoc (X (0)|X ((T à To) U Yo)) — Hoc (X (0) X (Po U Wo)) > £. 


The indirect assumption that the left-hand side goes to 0 for some sequence of 
Gibbs distributions in Q^ implies, using the compactness of Q9, that 


Hos (X (0)1X (7 N To) U Wo) = Hog (X (1X (To U Yo), 
for the limit QF € @° of a convergent subsequence. This equality implies 


Q6 (a(0)|a((F N o) U wo)) = OF (a(0)la(T'o U Y0)) 


for all a(0) € A, a(To U Wo) € AF?" o, By Lemma A.1 in the Appendix, these con- 
ditional probabilities are uniquely determined by the one-point specification Org, 
and the last equality implies 


Q(a(i)|a((T Mo)! U wi) = O(aG) aT, U wi») = Or, lalah). 


According to Lemma À.2 in the Appendix, this would imply (T N To) U Wo is a 
Markov neighborhood also, which is a contradiction, as (T N To) U Yo 2 E. 

This completes the proof of the lemma because there is only a finite number of 
possible intersections T A To. C 


PROOF OF PROPOSITION 5.1. We have to show that 


(5.11) PICr (x (A4)) > PIC (x(An)), 


eventually almost surely as n — oo, for all neighborhoods I with r(U) < R, that 
do not contain Ig. 

Note that I; 2 P? implies MPLr,(x(A,)) > MPLp(x(A4)), since 
MPLr (x (A4)) is the maximizer in Qr. of PLr (x (A5), Qr); see (2.2). Hence 


— log MPLr(x(A4)) 2 — log MPLruw, (x (A4)) 


for any neighborhood F. 
Thus 


PICr(x(An)) = —log MPLr (x (An)) + |A|! log | A, | 
> PICruw (x (A4)) — (AU Vol — | AUF log LA, |. 


140 L CSISZÁR AND ZS. TALATA 


Using Lemma 5.2 and the obvious bound |I U Wo] < IT] + [Wo], it follows that, 
eventually almost surely as n — oo for all T Z To with r (T) < Rn, 


PICr (x (A4)) > PICirnrouw x (A5)) — [Al Ato! — 1) log [A]. 
Here, by Lemma 5.4, 


PIC(rorguug GCAn)) 
> — log MPLirorguss (x(A4)) > —log MPLr, (x (A4)) + lAnl, 


eventually almost surely as n — oo for all I’ as above. This completes the 
proof, since the conditions r(U) < Ra and |An|/|An|— 1 imply [A] Hog lAn] = 
o(jAnl). LI 


6. Discussion. A modification of the Bayesian Information Criterion (BIC) 
called PIC has been introduced for estimating the basic neighborhood of a Markov 
random field on Z^, with finite alphabet A. In this criterion, the maximum 
pseudo-likelihood is used instead of the maximum likelihood, with penalty term 
[Aj log |An| for a candidate neighborhood F, where A, is the sample region. 
The minimizer of PIC over candidate neighborhoods, with radius allowed to grow 
as o(log!/C | A,[), has been proved to equal the basic neighborhood eventually 
almost surely, not requiring any prior bound on the size of the latter. This result 
is unaffected by phase transition and even by nonstationarity of the joint distribu- 
tion. The same result holds if the penalty term is multiplied by any c 0; the no 
underestimation part (Proposition 5.1) holds also if log | A,| in the penalty term is 
replaced by any function of the sample size |A,,| that goes to infinity as o(|A,]). 

PIC estimation of the basic neighborhood of a Markov random field is to a 
certain extent similar to BIC estimation of the order of a Markov chain, and of 
the context tree of a tree source, also called a variable-length Markov chain. For 
context tree estimation via another method see [5, 22], and via BIC, see [9]. There 
are, however, also substantial differences. The martingale techniques in [7, 8] do 
not appear to carry over to Markov random fields, and the lack of an analogue of 
the Krichevsky- Trofimov distribution used in these references is another obstacle. 
We also note that the “large” boundaries of multidimensional sample regions cause 
side effects not present in the one-dimensional case; to overcome those, we have 
defined the pseudo-likelihood function based on a window A, slightly smaller than 
the whole sample region Apn. 

For Markov order and context tree estimation via BIC, consistency has been 
proved by Csiszár and Shields [8] admitting, for sample size n, all k « n as can- 
didate orders (see also [7]), respectively by Csiszár and Talata [9] admitting trees 
of depth o(logn) as candidate context trees. In our main result Theorem 2.1, the 
PIC estimator of the basic neighborhood is defined admitting candidate neigh- 
borhoods of radius o(log!/C4) | A5 D, thus of size o (log! |A,|). The mentioned 
one-dimensional results suggest that this bound on the radius might be relaxed to 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 141 
o(log!/7 |A,,|), or perhaps dropped completely. This question remains open, even 
for the case d — 1. A positive answer apparently depends on the possibility of 
strengthening our typicality result Proposition 3.1 to similar strength as the condi- 
tional typicality results for Markov chains in [7]. 

More important than a possible mathematical sharpening of Theorem 2.1, as 
above, would be to find an algorithm to determine the PIC estimator without ac- 
tually computing and comparing the PIC values of all candidate neighborhoods. 
The analogous problem for BIC context tree estimation has been solved: Csiszár 
and Talata [9] showed that this BIC estimator can be computed in linear time via 
an analogue of the "context tree maximizing algorithm" of Willems, Shtarkov and 
Tjalkens [23, 24]. Unfortunately, a similar algorithm for the present problem ap- 
pears elusive, and it remains open whether our estimator can be computed in a 
"clever" way. 

Finally, we emphasize that the goal of this paper was to provide a consistent 
estimator of the basic neighborhood of a Markov random field. Of course, consis- 
tenzy is only one of the desirable properties of an estimator. To assess the practical 
performance of this estimator requires further research, such as studying finite 
sample size properties, robustness against noisy observations and computability 
with acceptable complexity. 


Note added in proof. Just before completing the galley proofs, we learned 
that model selection for Markov random fields had been addressed before, by Ji 
and Seymour [18]. They used a criterion almost identical to PIC here and, in a 
somewhat different setting, proved weak consistency under the assumption that 
the number of candidate model classes is finite. 


APPENDIX 


First we indicate how the well-known facts stated in Lemma 2.1 can be formally 
derived from results in [13], using the concepts defined there. 


PROOF OF LEMMA 2.1. By Theorem 1.33 the positive one-point specification 
uniquely determines the specification, which is positive and local on account of the 
loczlity of the one-point specification. By Theorem 2.30 this positive local spec- 
ification determines a unique “gas” potential (if an element of A is distinguished 
as the zero element). Due to Corollary 2.32, this is a nearest-neighbor potential for 
a graph with vertex set Z? defined there, and lo is the same as B(i)\{i} in that 
corollary. L] 


The following lemma is a consequence of the global Markov property. 


LEMMA A.l. Let ^ C Zf be a finite region with 0 € A, and V = 
(U,;zA D3) V A. Then for any neighborhood V, the conditional probabilities 


142 L CSISZÁR AND ZS. TALATA 


Q(a(i)|a(I* U Yİ)) and Q(a(i)|a((T* N AŻ) U V*)) are equal and translation 
invariant. 


PROOF. Since A and V are disjoint, we have 


Q (aliat? U w^) = Q(aG)la((T N Ay U (WU (TVA))))) 


Q(a(ti) U (T NA) la’ U (TVA) 
Q(a(T N A) jaw U (XA) 


? 


and similarly 


in ai iyu (n n A) acy! 
Q(aGi)la((T na) Uw) = £C ener. 


By the global Markov property (see Lemma 2.1), both the numerators and denom- 
inators of these two quotients are equal and translation invariant. [] 


The lemma below follows from the definition of Markov neighborhood. 


LEMMA A.2. For a Markov random field with basic neighborhood Yo, if a 
neighborhood Y satisfies 


Q(aG)la(^)) = Qro(aG)1aQ79)) 
for all i € £d, then T is a Markov neighborhood. 
PROOF. We have to show that for any A DT 
(A.1) Q(aGi)la(A^) = Qla (la T). 
Since I"g is a Markov neighborhood, the condition of the lemma implies 
Q(aG)la()) = Qaia T) = Q(aWla(o U A))). 
Hence (A.1) follows, because FCACIg UA. LI 


Next we state two simple probability bounds. 


LEMMA A.3. Let Zi, Z2,... be (0, 1}-valued random variables such that 
Prob(Z, = 1|Z;,...Zj-1} > Px > 9, jl 
with probability 1. Then for any 0 <v «1 


1 2 
Probi — ) Zi«v < e n(p./A0—vy. 
Is = j pal s 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 143 


PROOF. This is a direct consequence of Lemmas 2 and 3 in the Appendix 
of [7]. O 


LEMMA A.4. Let Z1, Z2,..., Zn be i.i.d. random variables with expectation 0 
anc variance D?. Then the partial sums 
Sk = Zi + Z2 d Zk 
satisfy 


Prob| max max Sk > D/n(u +2)} < a Prob(s, > D./np}; 


moreover if the random variables are bounded, |Z;| < K, then 
u? 
Probi $, > D «2 ———i Ü 
(Sa = Dny] < exp| 2(1 + LK QD n xl 
where u < D./n/K. 


PROOF. See, for example, Lemma VI.9.1 and Theorem VI.4.1 in [20]. O 
The following three lemmas are of a technical nature. 


LEMMA A.5. For disjoint finite regions C Z4 and ^ C Z’, we have 


Q(a(A)la(9)) > 4i^!. 


FROOF. By induction on [A]. 
For A = {i}, & = T9 \ 6, we have 


Q(a()a()) - X, Q(a)la( U 8) Q(aC)la()) 


a(B)eA= 


Y olalar) (alal) = quia. 


a(&)eA? 
Supposing Q(a(A)|a(®)) = ae holds for some A, we have for (i) U A, with 
& —Il9N(9U A), 


O(a(ti)U A)a($)) 2. $^ Q(a((i)Uu AU &)la()) 


a(E)eA? 


SO Q(a(Dla(AU BU $))Q(a(AU g)la(9)). 


a(8)eA* 


Since Q(a(i)la(A U BU ®)) = Q(a(i)Ja(T))) > qmin, we can continue as 


> dmn Q(a(A)la(9)) > gle", D 


144 L CSISZÁR AND ZS. TALATA 


LEMMA A.6. The number of all possible blocks appearing in a site and its 
neighborhood with radius not exceeding R can be upper bounded as 


lla(T, 0) e ATV9) : (py < R} < (AP + 098772 


PROOF. The number of the neighborhoods with cardinality m > 1 and radius 
rT) x Ris 


m 


pes 1) — ue 


because the neighborhoods are symmetric. Hence, the number in the proposition 
iS 
(QR4-1) —1)/2 


d 
A+A 5 u 0/2 ) jaj? 


m=] 
((2R+1)4—1)/2 
= |A| > is T pr Y 3 ([APyn1(QR*D* -D/27m. 
m 
m=0 


Now, using the binomial theorem, the assertion follows. [L] 


LEMMA A.7. Let P and Q be probability distributions on A such that 


max |P(a) — Q(a)| « esa PO) 
acÁ 2 
Then 
1 
S buys “Se Y (P() — O(a)? 


O(a) ^ minaca O(a) & 


acAÁ 


PROOF. This follows from Lemma 4 in the Appendix of [7]. LJ 


REFERENCES 


[1] AKAIKE, H. (1972). Information theory and an extension of the maximum likelihood prin- 
ciple. In Proc. Second International Symposium on Information Theory. Supplement to 
Problems of Control and Information Theory (B. N. Petrov and F. Csáki, eds.) 267-281. 
Akadémiai Kiadó, Budapest. MR0483125 

[2] AzENCOTT, R. (1987). Image analysis and Markov fields. In Proc. First International Con- 
ference on Industrial and Applied Mathematics, Paris (J. McKenna and R. Temen, eds.) 
53—61. SIAM, Philadelphia. MR0976851 

[3] BESAG, J. (1974). Spatial interaction and the statistical analysis of lattice systems (with dis- 
cussion). J. Roy. Statist. Soc. Ser. B 36 192-236. MR0373208 

[4] BESAG, J. (1975). Statistical analysis of non-lattice data. The Statistician 24 179—195. 

[5] BUHLMANN, P. and WYNER, A. J. (1999). Variable length Markov chains. Ann. Statist. 27 
480—513. MR1714720 


[15_ 


[16] 
[17. 
[18 
[19] 


[20] 
[21] 


[22] 


[23] 


[24] 


MARKOV FIELD NEIGHBORHOOD ESTIMATION 145 


COMETS, F. (1992). On consistency of a class of estimators for exponential families of Markov 
random fields on the lattice. Ann. Statist. 20 455—468. MR1150354 

CSISZAR, I. (2002). Large-scale typicality of Markov sample paths and consistency of MDL 
order estimators. IEEE Trans Inform Theory 48 1616-1628. MR1909476 


* CSISZÁR, I. and SHIELDS, P. C. (2000). The consistency of the BIC Markov order estimator. 


Ann. Statist. 28 1601-1619. MR1835033 


-7 CSISZÁR, I. and TALATA, ZS. (2006). Context tree estimation for not necessarily finite mem- 


ory processes, via BIC and MDL. IEEE Trans. Inform. Theory 52 1007-1016. 
DOBRUSHIN, R. L (1968). The description of a random field by means of conditional proba- 
bilities and conditions for its regularity. Theory Probab. Appl. 13 197—224. MR0231434 
FINESSO, L. (1992). Estimation of the order of a finite Markov chain. In Recent Advances in 
Mathematical Theory of Systems, Control, Networks and Signal Processing 1 (H. Kimura 
and S Kodama, eds.) 643—645. Mita, Tokyo. MR1197985 


-. GEMAN, S. and GRAFFIGNE, C. (1987). Markov random field image models and their ap- 


plications to computer vision. In Proc. International Congress of Mathematicians 2 
(A. M. Gleason, ed.) 1496-1517. Amer. Math. Soc., Providence, RI. MR0934354 

GEORGH, H. O. (1988) Gibbs Measures and Phase Transitions. de Gruyter, Berlin. 
MR0956646 


~ GIDAS, B (1988) Consistency of maximum likelihood and pseudolikelihood estimators for 


Gibbs distributions In Stochastic Differential Systems, Stochastic Control Theory and Ap- 
plications (W. Fleming and P.-L. Lions, eds.) 129-145. Springer, New York. MR0934721 

GIDAS, B. (1993). Parameter estimation for Gibbs distributions from fully observed data. 
In Markov Random Fields: Theory and Application (R. Chellappa and A. Jain, eds.) 
471—498. Academic Press, Boston MR1214376 

HANNAN, E. J. and QUINN, B. G. (1979) The determination of the order of an autoregression. 
J. Roy. Statist Soc Ser. B 41 190—195. MR0547244 

HAUGHTON, D. (1988). On the choice of a model to fit data from an exponential family. 
Ann Statist. 16 342-355. MR0924875 

JI, C. and SEYMOUR, L. (1996). A consistent model selection procedure for Markov random 
fields based on penalized pseudolikelihood. Ann. Appl. Probab. 6 423—443. MR1398052 

PICKARD, D. K. (1987). Inference for discrete Markov fields: The simplest non-trivial case. 
J Amer Statist. Assoc. 82 90-96. MR0883337 

RÉNYI, A (1970) Probability Theory. North-Holland, Amsterdam. MR0315747 

SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461—464 
MR0468014 

WEINBERGER, M. J, RISSANEN, J. and FEDER, M. (1995). A universal finite memory 
source. [EEE Trans Inform. Theory 41 643-6572. 

WILLEMS, F. M. J., SHTARKOV, Y M. and TIALKENS, T. J. (1993) The context-tree weight- 
ing method: Basic properties. Technical report, Dept. Electrical Engineering, Eindhoven 
Univ. 

WILLEMS, F. M J., SHTARKOV, Y. M. and TJALKENS, T. J. (2000). Context-tree maximuz- 
ing. In Proc. 2000 Conf. Information Sciences and Systems TP6-7—TP6-12. Princeton, 
NJ. 


A. RÉNYI INSTITUTE OF MATHEMATICS 

HUNGARIAN ACADEMY OF SCIENCES 

POB 127, H-1364 BUDAPEST 

HUNGARY 

E-MAIL: csiszar@renyi hu 

zstalata@renyi hu 

URL www renyi.hu/-csiszar 

www.reny1.hv/-zstalata 


The Annals of Statistics 

2006, Vol 34, No 1, 146-168 

DOI 10 1214/009053605000000886 

© Institute of Mathematical Statistics, 2006 


SPATIAL EXTREMES: MODELS FOR THE STATIONARY CASE 


BY LAURENS DE HAAN! AND TERESA T. PEREIRA? 
Erasmus University and University of Lisbon 


The aim of this paper ıs to provide models for spatial extremes in the 
case of stationarity. The spatial dependence at extreme levels of a stationary 
process is modeled using an extension of the theory of max-stable processes 
of de Haan and Pickands [Probab. Theory Related Fields 72 (1986) 477—492]. 
We propose three one-dimensional and three two-dimensional models. These 
models depend on just one parameter or a few parameters that measure the 
strength of tail dependence as a function of the distance between locations. 
We also propose two estimators for this parameter and prove consistency un- 
der domain of attraction conditions and asymptotic normality under appro- 
priate extra conditions. 


1. Introduction. The paper develops a framework as well as concrete models 
for statistics of spatial extremes which are sufficiently simple to be used in appli- 
cations. Only the case of stationary processes is considered and the dependence 
structure will be represented by one parameter or a few parameters. Instead of de- 
veloping more complicated models we aim at developing several simple models 
with somewhat different features. 

For simplicity of exposition and in order to stay close to the existing literature 
we shall start discussing processes which are defined on IR rather than R?. After 
that we discuss stationary processes in R^. 

The setting is as follows. Consider independent replications of a stochastic 
process with continuous sample paths 


(Xn(t))iem, 


n = 1,2,.... Suppose that the process is in the domain of attraction of a max- 
stable process, that is, there are sequences of continuous functions a, > 0 and b, 
such that as n — oo 


(1.1) ML 


| > (ben 
an (t ) teR 

Received November 2002; revised March 2005 

1 Supported in part by Gulbenkian Foundation. 

2 Supported by FCT/POCTI/35163/M AT/2000/FEDER (Project MEDE-—Statistical Modeling of 
Spatial Data). 

AMS 2000 subject classifications. Prxmary 60G70, 62H11, 62G32; secondary 62E20, 60G10, 
62MAO. 

Key words and phrases. Extreme-value theory, spatial extremes, spatial tai] dependence, max- 
stable processes, multivariate extremes, semiparametric estimation. 


146 


SPATIAL EXTREMES 147 


in C-space. Necessary and sufficient conditions have been given by de Haan and 
Lin [4]. The limit process {Z(t)} is a max-stable process. Without loss of generality 
we can assume that the marginal distributions of Z can be written as 


exp{—(1+ y(t)x) 1/7} 


for all x with 1+ y (t)x > 0 where the function y is continuous. For the time being 
we shall discuss the standardized process, called simple max-stable, 


(Z())hen := (0. y (0020) 9, 


whose marginal distribution functions are all Fréchet: exp(—1/x), x > 0. 

(Z(t)) is a simple max-stable process. We assume that (Z(1)) is a stationary 
prozess. The theory of de Haan and Pickands [5] applies. According to Theo- 
rer: 6.1 of that paper the process is determined by a nonnegative L function and a 
group of linear Li+ isometries. However, since we aim at manageable models, we 
shall restrict ourselves to the subclass of stationary max-stable processes which 1s 
discussed on pages 490-491 of [5], the one of moving maximum processes. The 
process is defined as follows. 

Let @ be a unimodal continuous probability density on the real line and let 
(X;, Y¥,}j>1 be the points of a homogeneous Poisson process on R x R+. The 
process is defined as a functional of the Poisson process as follows: 


Züycsmag t 
Jz! Y, 


for t e R. 


It is easy to check directly that this process is stationary and simple max- 
stable. The almost sure continuity of the process follows from [1]. We think of t 
as a space parameter, not a time parameter. Note that for t1, t2,..., tq € R and 
X1, X2, ...,Xq > O (cf. [5] 

+00 — à 
P(Z(t) 5 xi, .... Z6) 5x4) =exp|~ f max PET ast, 
—oo lzxixd Xi 

However, this is not yet sufficiently simple for applications. We shall consider 
three specific examples depending on just one parameter. For @ we choose the 
normal density 





B | p’x? | 
1 2 ex Gai. NAMUR cies : 
(1.2) DP 5 
the double exponential density 
(L3) F expi- fll) 


and the t-density 


2,23—04-1/2 
Eat x | 


04) AVP/2) " 





; with v a positive integer, 


148 L. DE HAAN AND T. T. PEREIRA 


where £ is a positive constant. The constant f measures the strength of tail depen- 
dence. We shall see that in all cases small values of 6 point at strong dependence 
and large values of 8 point at weak dependence. However, the dependence does 
not decrease at the same rate in all cases. 

As it should be for spatial models, the tail dependence between Z(0) and Z(t) 
decreases monotonically and continuously with [t|]. In particular, when |t| — oo, 
the random variables Z(0) and Z(t) become independent. 

The same happens for fixed t when varying f: as B | 0 the process becomes 
a.s. constant and as B — oo Z(0) and Z(t) become independent (cf. propositions 
below). The dependence decreases monotonically as f increases. 

Next we extend the definition to processes on R7, that is, to random fields. 
It is readily seen that the theory of de Haan and Pickands [5] remains valid if 
the underlying Poisson process is based on R? x R4 rather than R x R+}. So we 
consider a unimodal (i.e., nonincreasing in each direction starting from the mode) 
continuous probability density ¢@ on R^. Let (X p Wj, Yj},;>1 be the points of a 
homogeneous Poisson point process on R? x R4. The process defined by 


$(X; —t, W, — 1) 


for (£1, t3) € R? 
Y, 


Z(t, t2) :— — max 
J2 
is easily seen to be stationary and simple max-stable. The a.s. continuity of the 
process follows from an extension of the arguments in [1]. 
The specific models we consider are analogous to the ones in the one- 
dimensional situation: 


2 2722 E 2 
(1.5) banm = E epf- 
(we call this the normal model), 

2 
(1.6) $ (ti, t2) = E exp{—B(|t1| + 10]) 


(we call this the exponential model) and 


B^ P i Bat + t) | 
1.7 t1, t2) = —————— Pu 1 
(1.7) P(t, t2) = + Me — 1) a> 
(we call this the t-model), for B > 0. BM we consider the general normal model 
Bi Bo 


(1.8) $(n,5) = exp|- an 3i gh itt - -2pfifoti + Bid]. 


24/1 — p? 
where p is the correlation coefficient (—1 < p < 1), for By, B5 > 0. 

The paper is organized as follows. The two-dimensional distributions of the 
process are derived for the mentioned models in Section 2. Those are sufficient 
for the estimation theory developed in Section 3. The two-dimensional marginal 


SPATIAL EXTREMES 149 


distribution for the normal model was derived earlier by Smith in an unpublished 
paper [14]. Higher-dimensional marginal distributions do not seem easy to calcu- 
late explicitly. 

In Section 3 we are mainly concerned with estimating the dependence parameter 
B on the basis of observations at finitely many locations from a stationary stochas- 
tic process in the domain of attraction of the max-stable process. In all the models 
there is a simple relation between B and a well-known dependence coefficient for 
two-dimensional extremes, 


À —limr PU — Fi (Xi) x t and 1 — Fo(X2) < t}, 
where (X1, X2) has a distribution function F which is in the domain of attrac- 
tior. of some extreme-value distribution (Fi and F> are the marginal distribution 
functions). 


The coefficient A € [0, 1] is related to the general framework of multidimen- 
sional extremes in the following way. If 


im F” (anx + by, cay + da) = G(x, y), 


where the two marginals of G are of the form exp(— (1 + yx) ll P bia = 1,2, then 


imr P{] — Fy(X1) x tx or 1 — FX(X2) < ty) 
f 


"ULL. aeo 
= -logG(7——. LM —) 
y y2 
=: D(x, y) 
and 
ims" PI — Fi (X4) € tx and 1 — Fy(X2) < ty} 
H 
=x+y—L(x,y)= Rx, y). 
Then A = R(1, 1). 


D. Mason and X. Huang (see [9]) proved consistency and asymptotic normality 
for a natural estimator R(x, y) of R(x, y). We use this for estimating A = R(1, 1). 
This result leads to a consistent asymptotically normal estimator of B based on 
obse2rvations taken at just two sites, t, and t,. In general observations are available 
at sites £1, 12, ..., tg (i.e., finitely many). One of our estimators of f is the average 
of p-estimators based on the various pairs of sites. Consistency and asymptotic 
normality follow. 

The theory we develop can be used to solve some common problems in spatial 
extremes. One is an extrapolation problem: it consists of estimating the extremal 


150 L. DE HAAN AND T. T. PEREIRA 


behavior of a process X (t) at a site fy where no observations are available based on 
repeated observations of the process at d different space points £1, £2, ..., t4. An- 
other problem refers to the extreme values of the (unobserved) aggregate process 
fs X (t) dt over a space region S assuming that the process has been discretely ob- 
served at a number of space points in S. Similarly one can look at the tail behavior 
of sup, cs X (f). 

In order to attack those problems, the estimation of $ has to be complemented 
by estimation of local parameters: the extreme-value index, the scale and the lo- 
cation. The latter objects are not needed for the estimation of B but only for the 
application. 

The proposed models are quite simple examples of the representation in [5], 
which is valid for all stationary max-stable processes. It seems that the full model 
is not easily applicable: how does one estimate the initial nonnegative L; function 
and a general group of transformations? Moreover the representation is not unique. 
Instead we have tried to look at simpler models which can be analyzed mathemat- 
ically. We hope that providing several simple models with quite different features 
will widen the scope for applications. Later we want to consider somewhat more 
general parametric groups of transformations. 

"The validity of the model in applications can be checked with the following 
steps: estimate £ as outlined in Section 3, estimate R(1, 1) for each pair of sites 
using the relation in Corollary 3.2 and check if this estimate of R(1, 1) is similar 
to the direct estimate using Proposition 3.2. We have not done this yet. 


2. Marginal distributions for the two models, and in R and R2. We find the 
two-dimensional marginal distributions for all the models introduced in Section 1. 
Note that in this section, different from other sections, for simplicity and without 
loss of generality we consider the standardized process Z, not Z. First we consider 
the processes on the line, next the ones on R?. 


PROPOSITION 2.1. Fort € R and the exponential model, for wi, w2 > Q, 


—log P(Z(0) x wy, Z(t) < wy} 











T -plsi ,—Bls—t| 
(2.1) A max | £ |as 
2 J—oo Wi wo 
1 
1 for 0 < w < e Pifly,, 
2 
1 1 QÓ8u2 
(2.2) zd r E for e Phy, < w < ePltluy,, 
| 


—, for w > Pltw, 


SPATIAL EXTREMES 151 











arctan(eP!!l) in cos0 
= a s. m ls do 

(2.3) arctan(e ) Wy] 

1 e P 1 T 1 e £i 

+5 max| | +5 max —, l, 

2 wy w2 2 wi w2? 

where the spectral density s is given by 
—ßit]/2 

(2.4) (yes y (sind cos 9) 73/2. 


I ] ] Ww» 
Wy uw. w2 w1 


ana where the dependence function x is given by 


—S, 0 <s < eP, 
x(s) = { —e PEU? Js. eg Pil <s < pP s » 0. 
xd. s > efl, 


REMARK 2.1. Formula (2.1) follows directly from [5] and (2.3) reveals the 
spectral measure, which has a density on the interval (arctan(e 7’! l), arctan(e?!!l)) 
and atoms of size /1-+e~Al'l/2 at each of the two boundary points of that 
interval. The last characterization (2.4) is in the spirit of Sibuya [13] and 
Pickands [11]. 


PROOF OF PROPOSITION 2.1. The integrand of (2.1) is ge if 2 > 


~B s| -— 
£.— — , that is, if 
WI 





1 w] 
(2.5) s= - Isl < 5 log (73. 


Since the joint distributions of (Z(0), Z(t)) and (Z (0), Z(—t)) are the same we 
proceed as if t is positive. The left-hand side is t for s < 0, t — 2s for 0 < s < t and 
—t for s > t. Hence if 5 log n) > t, then inequality (2.5) holds for all s and we 


get the first line of (2.2). Similarly, if 3 log (#1) < —t, we get the last line of (2.2). 
Next suppose —t < 7 log (22) « t. In this case (2.5) becomes 


1 
£25 < zlog( 2), 
p w2 


that is, 


152 L. DE HAAN AND T. T. PEREIRA 


Hence the integrand over this interval becomes 











B [o e Als—t| , 1 p t/2—1/2B) log Qn/w2) ePGt) 
m eM risen. ARS Visum S 
2 /t/2-1/Qf)log(wi/uy)) W2 w2 2 J—oco 

Ll l g-t ebt/2 -0/210g (w/w) 

w2 2w»5 
1 1 e £I/2 
|». w 2./w Ww, 
This in combination with the integral stemming from the case £e E gives 
gr & we wj 


the second line of (2.2). 

To check the equivalence of (2.2) and (2.3) for e^ Pil < ees eP it suf- 
fices to see that the density of (2.2), after transformation to the polar coordinates 
domi w? + ws and 8 = arctan ne is r~>s(@) (cf. the construction of the spectral 
measure in [6]). For (w1, w2) outside this range just evaluate the integral in (2.3). 

[] 


REMARK 2.2. The parameter f controls the dependence: if B — oo, the spec- 
tral density s(@) goes to zero and the spectral measure is concentrated at the points 
(0, 7). This means that X (0) and X (t) are independent. If B | 0, the spectral mea- 
sure concentrates on {7}. This means that X (0) = X (t) a.s. 


PROPOSITION 2.2. Fort eR and the normal model, for wi, w2 > 0, 
—log T < wi, Z(t) < wj) 





E a E E 1 (Bil, 1 wi 
(2.6) s 3s -e(£7 Pip? g 22 a — (Es ar Tis g = 
2.7) B -fm ax{ aa |s (8) dé, 


where the spectral density s is given by 


ol fifi__!} KB, o1 
m mereri eric gi nano) e E ir z Inno) 


Fe 5G 4- ag In(tan »\o( = mis n(tan ) 
with p(u) = &'(u); 


1 l 1 w 
(2.8) boe (2), 


Wy wa wa Wi 
where the dependence function x is given by 


x69) » -se(- PP - ins) - e(- TP + ins 3! s 0. 


SPATIAL EXTREMES 153 


EEMARK 2.3. Again the parameter f controls the dependence: if B — oo, 
the variables Z(0) and Z(t) become independent; if B | 0, we get Z(0) = Z(t) 
a.S. 


REMARK 2.4. This distribution function has been obtained in a number of 
ways in the literature. Eddy [7] found it when studying convex hulls of samples. 
Hüsler and Reiss [10] obtained the distribution as the limit distribution of the com- 
ponentwise maxima in a triangular array where the distribution of the nth array is 
the two-dimensional normal distribution with correlation coefficient p (n) such that 
lim,—+oo(1 — p(n))logn exists. A related reference is [2]. In [14] Smith developed 
the distribution in the same way as it 1s done here. The distribution is mentioned in 
[3] and [12]. Falk, Hüsler and Reiss [8] obtained the distribution as the pointwise 
maximum of independent Brownian motions shifted by an amount corresponding 
to points of a Poisson point process. Another way of obtaining the distribution is 


— log P(Z(0) < wi, Z(t) < w2} 
] 1 
= E max| —, — exp (N81 — 6#?/2)|, t € R, 
wi wo 
where N is a standard normal random variable. 


PROOF OF PROPOSITION 2.2. Clearly the distribution depends only on |t]. 
So we consider t > 0 only: 
! B [e eg Pw eB uy 2 
— log P ZO < wi, Z(t) < wa} - —— f max}, 2 — | au 
j A/ 27t J—oo w] w2 


Now 


-B^[. .-Bw-»m. : 
La ee Lll i L ]n 7 
el if and only if Bu < + Bi In =, hence 


I 
— log P(Z(0) < ui, Z(t) < w2} 
1 | Bt /2--(1/Br) ln w/w 
~ wi 2n I-00 
1 1 oo 
w2 J2n Bt/2+(1/Bt) ln w/w: 
= e(Te rni) +f 7 o(-F + a nz]. 
Wy 2 Bt wy, Ww 2 Bt wi 
Hence the first part of the result. In order to obtain the second result note that 


32 
r735 (0) = (s zie P(Z(0) < ui, Z(t) < wy}] 


enl ? du 


g 4-80 du 


MN sin ð 
(cf. the construction of the spectral measure in [6]). O 


154 L. DE HAAN AND T. T. PEREIRA 


PROPOSITION 2.3. Forte R and the t-model, for w1, w2 > Q, 


— log P(Z(0) < wi, Z(t) < w2} 
1 
w2 
1 l 
— pr v(B, f; x) + — (1 id p2,v (P, t, 3) 
Wy u^. 
—(v4-1)/2 
b, o : 


0O<ur< bee wy), 


wi < wa < W], 


2 t 
Q9) = = PAT < |. pei 
w 2 


I 1 
—(1 Mi Pi,v(B, f, x)) a — p2,v(B, t, x), 
w1 w2 


Wi < wo «b o u, 
I 2 
— w > b7 PTD yy), 
l 1 
where 
224 242 
t t t 
pneus E. PU 1 p 
29. x» 4y 
242 2542 
H É t 
by =14 P | 1 pr 


Wo 4v ' 











o Bt | t*x V 
PiolB.t3)= P| Ty. gent «p T 

B Btx | t?x V 
pas = P| Dem «p dmm 








Ty. | is a random variable with a Student t-distribution with v degrees of freedom 
and scale parameter 1 and x = m)? FD. 


1 1 1 w 
(2.10) vx. 


SPATIAL EXTREMES 155 


where the dependence function x is given by 


spi, (b, t, s77 D) — s — pa (B, t, ger 
po wu. P « 1, 


x(s) = E z 
—sp1.v(B, f, S aen) m ] T pa, (B, t, M ZEE 
fans < pt 
a A ees be 


50. 


REMARK 2.5. The spectral measure corresponding to (2.9) is concentrated 
n ib; "od : Y noi ^|. having a density on (b; 5 ME B E í 
pe of the Boundass points of the interval. 


) and atoms at 


PROOF OF PROPOSITION 2.3. 


—log P(Z(0) < w1, Z(t) < w2} 


BED) 
-© mT (v/2) 
00 2:2 —(v--1)/2 Pr PENIS —(v4-1)/2 
T f maf EE LE A du 
—00 Wy w^. 


2.2 (v+1)/2 (PENES) (v4-1) 
Now (+8 Ls t > e py if and only if (1 — x)u? — 2tu > 
-E t ? This is gnen D 





2 xt? V 
o (ege 


if O < x < 1 and to the reversed inequality if x > 1. Hence if 0 < x < bi, then 
(2.11) holds for all u and we get the last line of (2.9). Similarly if x > b» ,, we get 
the first line of (2.9). In the case b1,, « x < 1 condition (2.11) becomes 

xt? 


UM ee 
(t qu 








ui— > 
| 1—x|~ 
that is, 

xt? v t xt? V 


f 
EET mr eye « RN 
Bep p ue ESI cU 





156 L. DE HAAN AND T. T. PEREIRA 
Hence the integral over this interval becomes 


BT (v 4 )/2). [ta 77/5109 18. (1 + gh? [yy- FDP 


of d 
4Vl(v/2) J—oo WI u 
BT(v 1/2) [*9 (1 + 2u2/v)- 0/2 , 
p AA Se Urhu m n 
AVV (v/2) Jt/0-x)+4 / xt? /(1—x)?—v/ B? w] 


ol Bt xt? V 
"aret Uum) 


2 
ina een il 3l 





1—x (=x) g 


os ee 2,42 /yy-@+1)/2 
This, in combination with the integral stemming from the case ay 9 dae 
1+ 2(y—t 2 V —(v-1)/2 » à a " š 
in s: URD AL E, gives the fourth line of (2.9). Similarly in the case 1 < x < 


w2 
b» y we get the second line of (2.9). Note that in the case x = 1, Orge rm > 
(1-82 u-t)? v 9072. , 

y if and only if u < t/2 for t > 0 and u > t/2 fort < 0. Hence 


we get for (2.9) EPIT, 1<Alt|/2}. O 
Next we move to the two-dimensional models. 


PROPOSITION 2.4. For t = (t4,t2) € R^ and the exponential model, for 
W1, w2 >Q, 


— log P(Z(0,0) < w1, Z(H, t2) < w2} 








] 
aun (w1, w2) € A1, 
Wy 
i i ea _ 1 AE zt: gni dnd "a zm(2)], 
w we o w> 4 4 \w2 
(w1, w2) € A2, 
(2.12) T 4 1l M J e- Bt Heb? d pee 
= W w2 = ./wiw2 2 
(w), w2) € A3, 
MN E —e RM 4p hl Inl, (22) 
Wi we AD 1107 4 4 NuI 
(w1, w2) € Ag, 
1 
= (w1, w2) € As, 


SPATIAL EXTREMES 157 


where 
1 w 
Ai = [eo wa): zm (2) « dne iab]. 
P Xu» 
Aa = [Gun wa): (in| + aD s zt ( Zi) < jaja ial - lel vll]. 
l wy 
A3= faur, wo): ^ |t2| — Iti] v |t2] < z In (=) < [ti| V [t2] — I] ^ inl. 
p w2 
I w] 
Ag = | (wr, ws): ini V iel- lala lel < In (2) ente leh, 
B \we2 


1 w 
As = T w): In ( =") >|] + nl 
p 102 


PROOF. We work out the integral as in the one-dimensional case on the areas 
of u defined by |u; — tjl + luo — t| — jui] — [ual Z z In G2. It is convenient, 
similarly to the one-dimensional case, to consider separately the nine areas defined 
by the position of u; with respect to 0 and t; and by the position of uz with respect 
to 0 and fy. So, for example, if 0 < t; < f2, we can write 


lui — t] + luz — t2| — |ui] — Iu] 


-f — fo, ifu, >t; and u2 > t5, 
—2u5 + to — th, if u; > t; and 0 < u? <b, 
[0-1 if uy > tı and uz < Q, 
—2u; +t, — h, if 0 <u; « t1 and u? > h, 
= { —2u; — 2u: +t +h, if 0 <u; «t1; and 0 < u <h, 
—2u; +h +h, if 0 <u; < tj and u2 < 0, 
t1 — to, if u; < 0 and u$ > bh, 
—2u» +t +t, if u; <Oand0 < u? < t5, 
tr +h, if uj <Oandu2 < Q. 


The calculations are complicated but not difficult. L] 
PROPOSITION 2.5. For t = (ty, t2) € R? and the normal model, we have for 
w1, W2 > 0, 
— log P{Z(0, 0) < wi, Z(t, 2) < w2} 


m t8 1 IB. 1, wj 
uS + ah tan a + ee z 


that is, it is the same as in the one-dimensional case with |t| replaced by |t]. 


158 L. DE HAAN AND T. T. PEREIRA 


-B^u?/2 ,-f?l-tu?/2 
PROOF. As in the — case. Now TOS > A if and 


only if Bh < Ig + iz 5 log = A . Next note that if s vector (U;, U5) has a 
standard two-dimensional normal distiibutions ee is also standard normal. 
The rest of the proof is as in the one-dimensional case. C] 
PROPOSITION 2.6. Fort = (t1, t2) € R? and the t-model, for w1, w2 > 0, 
— log P{Z(0, 0) x wi, Z(t, t2) € w2} 
1 
wy’ 


1 ] 
—Pi(T1, 12) € Aya] + — PEUT, 12) € A5 4). 
wt WU» , 


0 < w < bz sw, 


by „Wi £ W2 < wi, 


























(2.14) 2 
= ae < |t|/2}, wi = W2 =: W, 
1 1 
“PUR EA I t PUO D)eAs ab 
Wy j w2 
Wi < W < by SUI, 
1 
773 = bi^ 3 
TT W2 Z 01,4," 
where (Ti, T?) is a random vector with bivariate t-density (1.7), 
_, BP BIt B?|t? 
bia mi -—E——At i 
A(a — 1) 2(a — 1) 8(a@ — 1) 
21412 2112 
t t t 
a EE MN ETUR 
J (a — 1) 8(a@ — 1) 
t 2 f 2 
Ala = -" ER’: (u = E) + (w = =} 
l—x l—x 
x|t|? 2(a — 1) ! 
~ (1— x)? pe o 
A 2 
f I 
A2,a = (urm) e R*: (u — ==) + (ws - ac 
' ]l—x ]-—x 
x|t|? 22] 
~ (1—x)? p? 
and 


[EY 
Xumi-v-— à 
w2 


SPATIAL EXTREMES 159 


PROOF. Analogous to the one-dimensional case. O 


PROPOSITION 2.7. For t = (tj, t?) € R? and the general normal model, for 
w1, w2 > Q, 
— log P(Z(0,0) < wi, Z(t1, t2) < w2} 
1 /^t'YX-t 1 
(2.15) = —o( Fe o 22) 
w] 2 Jt'x-M wi 
] [(wt'x-it 1 wy 
(1 + — log — ], 
2 Vt o-lt © we 
where 





poi. 1 | Pi kou 
1—p*|—pBipo B2 | 


f , i -u7 Elyse —-(a-07 elas . 
PROOF. As in the one-dimensional case. Now m ^ > ce DIE if 


and only if u/x-t« grt + log o Next note that if the vector U = (U1, U2) 
has a bivariate normal distribution with mean value zero and covariance matrix X, 
UT Y; -! t has a normal distribution with mean value zero and variance t^ X ^1 t. The 
rest of the proof is as in the one-dimensional case. C] 


3. Estimating the dependence parameter f. We consider a sequence of in- 
dependent, identically distributed stochastic processes with continuous paths 


(X, (t)}rer, i= 1, 2. DAD 


We assume that the processes are in the max-domain of attraction [as processes in 
C (IR)] of a max-stable stationary process { Z (t)}rer such that the related process Z 
(see the Introduction) has exponential spectral function (2.2) or (2.12), or normal 
spectral function (2.6) or (2.13), or t spectral function (2.9) or (2.14) as discussed 
in Section 2. For definition of convergence and convergence criteria see [4]. 

Ideally it would be nice if we could assume that we have observed the sample 
paths of n processes X as a basis for estimation of the main parameter B. However, 
in reality this is too much to expect. Usually one can observe the n processes only 
at finitely many points in space, say t1, f2,..., tg. 

In this setup we propose estimators for £ that are closely related to an extension 
of the estimator R(x, y) for the dependence function R(x, y) which was intro- 
duced by Mason and Huang (see Huang [9]), 


n 


Ra, sta Qs eed) = T TO G0 pea (DX 02 Xo tgp 8] 
i] 


160 L. DE HAAN AND T. T. PEREIRA 


with k <n, where (X, n(t;)};—; are the nth-order statistics of (X,(t,))7.., for j = 
1:2 2d: 

Let us now introduce our estimators in IR. All three densities have the form 
B¢o(Bt). Corollary 3.2 below states that 


+00 
Riot beans I) =2 qo(t) dt 


(8/2) (max, «, «41, —minj « , «41,) 


E 241 7 FS (meno 7 mins ))| 
with F(t) :— f... o(s) ds. It follows that 
IDàX,« «qf , — MIN} <)<dly 
Hence we introduce the estimators 





j max}<j<aty — MİN <j<dtj 
and 
^ 2 A 
ĝi = LL Prk. (1, 1)/2). 
d(d " 1) P» |t, ms tn | ( NT ) 


Considering the two-dimensional space, note that the standard normal and Student 
densities are spherical symmetric. This means (cf. proof of Proposition 2.5) that 
it is sufficient to consider the marginal distribution and hence we introduce the 
estimators 


: 2 2 - : 
h= 4-1) > memi (1 — Rt, e, (1, 1)/2), 


l<j<m<d 
with F the (common) marginal distribution corresponding to gp (standard normal 
or Student). The exponential model in two-dimensional space is more complicated. 
We consider 


^ A ^ 
Be,2 a d(d —1) 2 Qa, m bjm (Ri c, (1, 1)) 


] € ;j «md 
with a, i= |t? Im I. the absolute difference of the first components of 
t; and tm, and bjm :— ie? — i” | and Qa, p the inverse function of i + 
Ê min(a, b)) exp(—5 (a -+ b)), which is decreasing in f for a, b > 0. 
Note that 8“ is simpler than ĝ and summarizes the information of the sam- 
ple in a somewhat more crude way. We could not find analogues of BO in 


two-dimensional space since we were unable to calculate explicitly the necessary 
higher-dimensional distributions. 


SPATIAL EXTREMES 161 


All the mentioned estimators are consistent and asymptotically normal under 
the appropriate conditions. We now state the results. The proofs will be given at 
the end of the section after some lemmas. First we consider consistency. 


THEOREM 3.1. Suppose that the normalized sequence of maxima |see (1.1)] 
converges weakly to (Z(t)) in C(—o0, 00). For sequences k = k(n) > oo, 
k(n)/n — 0 as n — œ (recall that the sequence k figures in the definition of R), 
ail the indicated estimators are weakly consistent for B. 


Also the estimators are asymptotically normal under some extra conditions, that 
is, Vk (B — p) has asymptotically a normal mean zero distribution. In order to de- 
scribe the asymptotic distribution more accurately, we now state a slight extension 
of a result of Huang and Mason. 


PROPOSITION 3.2 ([9], pages 29 and 43). Let {(X,(t1),..., Xi (ta) c] be 
i.i.d. random vectors with distribution function F. Suppose that the marginal dis- 
tributions F, (i — 1,2,...,d) are continuous and strictly increasing. Define 


Págs s xa) im 1 — F(FE (A —x)),..., FP x2) 
= P(1— Fi(X(t)) E xi or -+ or1— Fa(X(ta)) < xa}, 
where the arrow denotes inverse function. Suppose that for all x1, x2, ..., xq > 9, 
xi +x2+---+2xg > 0 and a positive function L, 


_ 1- 
(3.1) 2n F Prts (x1, IX2, «4.5 txd) = Ls.t;,. Mg (xı, X25, 200 Xa). 


Next we introduce the definition of a function R which is connected with the func- 
tion L as follows. Let Vz, t... t be the measure that satisfies for x1, x2,...,xq > 0 


Vis. OG -- -3 Sa)|81 < X1 Or +++ or sq € xa] = Ln, ux... Xa). 
Such a measure exists by virtue of (3.1). The function R is given by 
Rua, au Xp X2, ss Xd) = Vh ty,...,t7 (0; x1] X [0 x2] x x [0, xa D. 


Define 


^ ]z 
Ri, uta s s Xd) = k » Lx, (> Xn xi Ha), Xi Ga) Xn px aeta Ga) 


i=l 


where Xn—[kx,]+1,n(t,) is the (n —[kx;]+ 1)st-order statistic of X1 (tj), X2(t,), ..., 
Xn(tj), j= 1,2,...,d. Then for all x1, X2, ..., xa > 0, x1 + x2 +++: t x4» 0, 


(3.2) Riiat X0 X2, X2) Rt, dta Eoo ween ed) in probability, 


n — co, k  k(n) — oo, k/n > 0. 


162 L. DE HAAN AND T. T. PEREIRA 


Suppose now that the function L has continuous first-order partial derivatives 


L5. tg J =1,2,...,d. Moreover suppose that for some a > 0 


Fi, . ag x1, Ix2,..., txq) 
=t{ Ly ty, au X2; +--+, xa) + O(t?)], t |, 0, 


uniformly on x? 4 x2 - +x? = 1, x; > 0, i = 1,2,...,d. Then for a sequence 
k = k(n) > oo with k = o(n?*/Ca*D). n —> oo, 


(3.3) 


^ d 
(3.5) Jk(Rs, aŒ e xa) — Rat ses x2)] > Ba, aus «+ Xa) 
in DRÉ), where 
By, to, P tg (Xi, XD, 00 Xd) 
= Wh ip, ota (X xoci d) 
Dedi O nega) Winsett tuU cas) 
IPS UCM p itd (xi X25 sees Xd) Wi tasecosta (0, 0, sey Xd) 
and W is a continuous mean zero Gaussian process with covariance structure 
EWhim....t (xı, s.. Xd) Wi, — td (1, sony yd) =a VN. td (Ax, «Xd 1 Ay), sa) 
with 
Ax) xs, xg == Ato f2,---, tai < x1 orto? < x2 or +--+ orta < xq]. 

The function L is called the tail dependence function and it is directly related 
with the extreme-value limit distribution. In fact if F is in the domain of attraction 
of an extreme-value distribution G, condition (3.1) holds with L(x1,..., Xa) = 
— log G((— log G1) * (x1), ..., (— log Ga) * (x4)) where G,,..., G4 are the mar- 
ginals of G. 

REMARK 3.1. For the exponential model in one-dimensional space 

oO 
Ls, 01, Xd) = EJ max(x,e PISA! ee xge P5) ds 
2 J—oo 
and (see Lemma 3.1 below) 


oo 
Riot (x1, "^5 Xa) sore f | min(x;e fkl, — xge Pital) ds. 
SCR) 


Similarly for the normal model and the t-model. 


SPATIAL EXTREMES 163 


THEOREM 3.3 (One-dimensional space). Suppose for the processes {X,(t)} 
the limit relation (1.1) holds and moreover the second-order condition (3.3): for 
some a 0 


Fa ata (1s sess txa) = t{— log P(Z(t) € 1/x1, ..., Z(ta) € 1/xa} + O(t?)), 
L3. uniformly on x? + x2 +- + x4 = l.i Oot Se Gud. 


IF k = k(n) — oo, k(n) = o(n?2/C2tD). a5 n — oo, then 


- . 2 2B, tm (1, 1) ] 
G5 Vh -D-i- 4d-1 2- ii oy l/d 


JE — p 


(3.6) 
ne 2 By, to,...,tg (ls s sey 1) l 
max]«j«gf; — minj«j«qt; qo(B(maxi«,«41; — mini <j<qtj)/2) 


in cistribution. Here B is as in Proposition 3.2. 


THEOREM 3.4 (Two-dimensional space). Under the same conditions, 


2 2Bt, tal, 1) 1 
nn J/k Dni, Lucho PE RN 
G0 — VR EAB) FGF ity t. mit; tol 
anc 
(3.8) Vk(Be2~- B) > ———— T = D 2a Bus (L 1) Qe bm (Q2, o, BD) 


) m 
in distribution, where a jg :— Da ~ pu b jm i= |? — pr 


For the estimation of the general normal model we proceed as follows. Write 


yc 1 | p? -n 
1—p?|—pBifr B2 
with —1 < p < 1 and £i, B2 > 0. For two sites t; and tm (1 < j < m < d) we write 


(t; — t4)! X^! (tj — tm) 


I 1 2 
= e) 





— 2pBi Balt; -P — i) + BC? y] 


ee 
= t, ma 


164 L DEHAAN AND T. T. PEREIRA 


where 
emt. 326 for j — 1,2,...,d, 
(i n" (Dy? 
bsc TM — ANTE — dd 
Qn u doy 
and 
B? 
1 
(3.9) ai= —2 0B; B2 
p; 


Now note that 

2— Rt, tes 1) = Lt, ta (1, 1) z20(/lt; — tml? Ett; — tnl/2). 
Hence we define estimators 
(3.10) Qm = (2 (1 — Rt, 4, (1, 0/2)". 
Using the result of Proposition 3.2 and Cramér's delta method we get 
G11) VK, — t1,2) S 2Be, ty 0, DE a (t7, 2/2) ^, 


where ¢ is the standard normal density. Now compose the (d(d — 1)/2)-dimen- 
sional vectors 


and 


T 
ta 
Then 


(3.12) Jk- Ta) Sb 


SPATIAL EXTREMES 165 


with 
Bye (l, D tf ,a(o(/t? 2/2) 
i By ts (1, Du'tT zalo (t7 38/2) 
Buit (l, DIT aal (tL 3/2) 
Next define 
(3.13) à:— (T! T) ! Tq. 
Then 


3.14) Vk(@—a)=(? Tr? /k(à — T3) 5 (TT)! F7. 
By solving the equations (3.9) we get 


2 
| Lh. 29 
ßı := ja hay’ 
2 
o€ g 
Pa a3 n 
and 
* LL a2 
P =a Jaaa 


Crzmér's delta method now gives the joint asymptotic normality of the estimators 


^2 
: A la un) 
3. ees = 





^2 
l A . a 
and 
` à2 
3.17 = — 
20) ERU 6o 


For the proofs of the theorems we need a number of auxiliary results. 


LEMMA 3.1. Suppose for some measure v on RÁ and some positive integrable 
functions 8), .... 84 


v([x1, 00) U --- U [xg, 00)) = [mmt — X1)... g4(x — xaq)] dx. 


166 L. DE HAAN AND T. T. PEREIRA 
Then 


OO 
v([x1, 00) N ++» xa, 00)) = |. min(gi(x —x1),..., ga(x — Xa )} dx. 


PROOF. Note that 
v(Ixi, oO) U [x2, oo) Mie) [xa, oo)) 


d 
= 5$ v([x,,00)) + -D » v([x;, o0) N[x,, o0)) 


ise] tJ 
Te OI I v(pasoo) n [x2 00) N +- [xa, 00)), 
and for any real a1,42, ..., aq, 
d 
maxí(d;,42,...,G4) = Xai set) $ ai ^aj c (—1)9 la ^aa A+++ Aad 
t=] ij 


(both follow by induction). Then the statement also follows by induction. LJ 


COROLLARY 3.1. Hence for our models in R 


to ó(s—ti) 
Rips) X2 X2) = | min —————- ds. 
—-oo .! Xi 


REMARK 3.2. We apply Lemma 3.1, for example, to the functions g,(x) :— 
e Pl 
e 





LEMMA 3.2. Suppose p is a probability density on R, p(x) — p(—x) for 
x > 0 and p(x) is decreasing for x > 0. Then 


| min(p(Is — tib, ..., ps — tal} ds 
(3.18) D 


2. p(s)ds. 


1/2)(max, <; «qt, —mini« j «at;) 
PROOF. Note that 


min(p(|s — t1, ..., p(s — tap] 




















— min 5— min f,|}, $— max f 
{P( 1x Xd i ) P( 1x7 xd ^ ) 
; . 1 i 
s— mint S> min ft; + max 1, — min f, }, 
p( l<j<d 7 ) 1xjxd ” (max, ^ 1«Xjsd j) 
p(|s— max sl), s < min tj (max t — min t); 
1x7 Xd 1xjxd 1xjxd 1j xd 








SPATIAL EXTREMES 167 


So the integral on the left-hand side equals 








oO 
mini «421, (1/2) (max 1 «j «21 —min «j «q1;) 1xJ xd [] 
COROLLARY 3.2. Hence for our models in R 
OO 
Ry, tas... 1) =2 $ (s) ds 


(1/2) (max; « «qt , —mtni « «qt ) 
P 
~ jp D F(S (mx t p E22) | 
LEMMA 3.3. Under the conditions of Theorem 3.3, 


E P ; 
AER us ncn t F(a 2,4) 
e B,.. (l, ILLE 1) 


in distribution with B as in Proposition 3.2. 


(3.19) 


PROOF. Combine Proposition 3.2 and Corollary 3.2. (| 


LEMMA 3.4. Under the conditions of Theorem 3.4, for the standard normal 
and Student models, 


P 
2 
in distribution. For the exponential model 


" 1 : j 
VE Ra, (Ls Les ;(i i É min(|ef? -APL — gp) 


(3.20) VE Ra, C. 1) -2(1 z F( (It, tnl)))| — Bet, (1, 1) 


(3.21) x e SPP APR? | 
— By as. 1) 
in distribution, where tj = (t,t), tm = (109, 8”). 


PROOF. Propositions 2.5, 2.6 and 24 give Lt, tm (1, 1), which is 
2 — Ri, (Ll D. E 


PROOF OF THEOREM 3.1. Follows immediately from statement (3.2) of 
Proposition 3.2 and Lemma 3.2. U 


PROOF OF THEOREMS 3.3 AND 3.4. Immediately from Lemmas 3.3, 3.4 and 
Cramér's delta method. O 


168 


L. DE HAAN AND T. T PEREIRA 


Acknowledgments. Comments of two referees made us aware of Smith [14] 
and helped greatly to improve the presentation. 


[1] 
[2] 
[3] 
[4] 
[5] 
[6] 
[7] 
[8] 


REFERENCES 


BALKEMA, A. A. and DE HAAN, L (1988) Almost sure continuity of stable moving average 
processes with index less than one. Ann. Probab 16 333—343. MR0920275 

BROWN, B. M. and RESNICK, S. I. (1977). Extreme values of independent stochastic 
processes. J. Appl. Probab. 14 732—739. MR0517438 

COLES, S G. (1993). Regional modelling of extreme storms via max-stable processes J. Roy. 
Statist. Soc Ser. B 55 797.816. MR1229882 

DE HAAN, L. and LIN, T (2001) On convergence toward an extreme value distribution in 
C(O, 1]. Ann Probab. 29 467—483. MR1825160 

DE HAAN, L. and PICKANDS, J., III (1986). Stationary min-stable stochastic processes 
Probab. Theory Related Fields 72 4T]—492. MR0847381 

DE HAAN, L. and RESNICK, S. I. (1977) Limit theory for multivariate sample extremes. 
Z. Wahrsch. Verw Gebiete 40 317—337. MR0478290 

EDDY, W. F (1980) The distribution of the convex hull of a Gaussian sample. J. Appl. Probab. 
17 686-695. MR0580028 

FALK, M., HUSLER, J. and REISS, R.-D. (1994). Laws of Small Numbers: Extremes and Rare 
Events. Birkhüuser, Basel. MR1296464 


[9] HUANG, X. (1992). Statistics of bivariate extreme values. Ph.D dissertation, Tinbergen Insti- 


tute. 


[10] HUSLER, J. and REISS, R.-D. (1989). Maxima of normal random vectors: Between indepen- 


[11] 


dence and complete dependence. Statist Probab. Lett. 7 283—286. MR0980699 
PICKANDS, J., III (1981). Multivariate extreme value distributions (with discussion) Bull 
Inst. Internat. Statist. 49 859—878, 894—902. MR0820979 


[12] SCHLATHER, M. (2002). Models for stationary max-stable random fields. Extremes 5 33-44. 


MR1947786 
[13] SiBUYA, M (1960). Bivanate extreme statistics. I. Ann. Inst. Statist. Math. 11 195—210. 
MR0115241 
[14] SMITH, R. (1990). Max-stable processes and spatial extremes. Unpublished manuscript. 
ECONOMETRIC INSTITUTE DEPARTMENT OF STATISTICS 
ERASMUS UNIVERSITY AND OPERATIONAL RESEARCH 
ROTTERDAM FACULTY OF SCIENCE 
NETHERLANDS UNIVERSITY OF LISBON 
E-MAIL’ Idehaan @ few eur nl CAMPO GRANDE, BLOCO C6, PISO 4 


1749-016 LISBOA 
PORTUGAL 
E-MAIL’ tpereira@fc ul pt 


The Arnals of Stanstics 

2006, %1] 34, No 1, 169-201 

DOI D 1214/009053605000000895 

© Insttute of Mathematical Statistics, 2006 


PENALIZED MAXIMUM LIKELIHOOD AND SEMIPARAMETRIC 
SECOND-ORDER EFFICIENCY 


By A. S. DALALYAN, G. K. GOLUBEV AND A. B. TSYBAKOV 
Université Paris VI, Université Paris VI and Université Aix-Marseille 1 


We consider the problem of estimation of a shift parameter of an un- 
known symmetric function in Gaussian white noise We introduce a notion 
of semiparametric second-order efficiency and propose estimators that are 
semiparametrically efficient and second-order efficient 1n our model. These 
estimators are of a penalized maximum likelihood type with an appropriately 
chosen penalty. We argue that second-order efficiency is crucial in semipara- 
metric problems since only the second-order terms in asymptotic expansion 
for the risk account for the behavior of the “nonparametric component" of 
a semiparametric procedure, and they are not dramatically smaller than the 
first-order terms. 


I. Introduction. Semiparametric statistical models are the ones containing a 
finite-dimensional parameter of interest 9 and an infinite-dimensional nuisance pa- 
rameter f which is a member of some large functional class. The goal is then to 
estimate @ efficiently without knowing f. A comprehensive account of the the- 
ory of semiparametric estimation is given in the book of Bickel, Klaassen, Ritov 
and Wellner [3]. In particular, it is shown that for many semiparametric models 
there exist estimators attaining the same asymptotic performance as efficient para- 
metric estimators constructed for the problem where f is completely specified. In 
other words, for such semiparametric models there is no loss of efficiency as com- 
pared to the corresponding parametric models with known f. These semiparamet- 
ric models are usually called adaptive, but we prefer here to call them S-adaptive, 
or semiparametrically adaptive, in order to avoid confusion with nonparametric 
adaptivity to unknown smoothness of f . Estimators attaining parametric efficiency 
in 5-adaptive models will be called S-adaptive (or efficient) estimators. Here and 
in what follows efficiency 1s understood in a local asymptotic minimax sense. 

-here exist various methods of constructing S-adaptive estimators. À general 
feature of these methods is that they proceed by "eliminating" the nonparamet- 
ric >omponent f, thus reducing the original semiparametric problem to a suitably 
chcsen parametric one. The most common approach is to specify a least favorable 
parametric submodel of the full semiparametric model, locally in a neighborhood 
Of f, and to estimate 9 in such a submodel ([3, 22, 24, 30-32] and the references 


Received July 2003; revised March 2005. 

AMS 2000 subject classifications. 62G05, 62620. 

Key words and phrases. Semiparametric estimation, estimating a shift of a nonparametric func- 
tion second-order efficiency, penalized maximum likelihood, exact minimax asymptotics. 


169 


170 A. S. DALALYAN, G. K. GOLUBEV AND A B. TSYBAKOV 


cited therein). Least favorable parametric submodels turn out to depend on f only 
via a score function. "Elimination" of f under this approach means to estimate 
nonparametrically the efficient score function. Resulting estimators of 0 are often 
defined via one-step procedures that involve preliminary estimators of 0 and non- 
parametric estimators of the efficient score function. We note here, in connection 
with the discussion that follows below, that results on efficiency and S-adaptivity 
are not very sensitive to the choice of preliminary nonparametric estimates of 
the efficient score function. For example, kernel, orthogonal series, nonparamet- 
ric maximum likelihood and other estimates can be used, under rather wide as- 
sumptions on their parameters, such as kernels, bandwidths, etc. The important 
question of how to choose these parameters in practice is left open. Among other 
approaches that allow one to eliminate f efficiently we mention profile likelihood 
techniques [25] and invariance-based inference [13]. 

Thus, for a variety of semiparametric models, the statistician actually has an 
entire library of efficient (S-adaptive) estimators of 6. Which estimator is the best 
one? The theory discussed above does not answer this question because it deals 
only with the first-order asymptotics, which is the same for all S-adaptive estima- 
tors in a given model. Distinguishing between these estimators is possible on the 
basis of higher-order asymptotics. This motivates us to study here second-order 
asymptotics and second-order semiparametric efficiency. We would like to em- 
phasize that a study of second-order effects is more important for semiparametric 
models than for purely parametric ones and it is crucial for practical implementa- 
tion, at least for the following reasons. 


e This is a compelling way to distinguish between various efficient semiparamet- 
ric methods and to choose the best among them. More specifically, it allows one 
to choose optimally the smoothing parameters that define the “nonparametric 
component” of a given family of efficient semiparametric procedures. 

e Second-order terms in asymptotics for semiparametric estimators are not dra- 
matically smaller than the first-order terms; they might be in fact quite compa- 
rable to each other for moderate sample sizes. Second-order terms depend on the 
smoothness of f. For example, in a typical case of twice differentiable f we get 
second-order terms ~ n^ 7/19, the first-order asymptotics being as usual n^ /?, 
where n is the sample size. This differs from the purely parametric situation 
where the second-order terms decrease as n^! (cf. [20]). 


Whereas first-order efficiency considerations for semiparametric models are es- 
sentially of a parametric flavor, the second-order ones come from nonparametric 
function estimation. Therefore, it is not surprising that the importance of second- 
order semiparametric asymptotics was first realized in the literature on nonpara- 
metric smoothing. Hárdle and Tsybakov [15] pointed out that, in the single index 
model, the second-order term of the risk of the average derivative estimator is not 
significantly smaller than the first-order one and suggested choosing the optimal 


SEMIPARAMETRIC SHIFT ESTIMATION 17] 


bandwidth by minimizing an asymptotic approximation of the second-order term. 
Mammen and Park [21] proceeded in a similar way to derive the optimal band- 
width for estimation of the efficient score function in the symmetric location prob- 
lem. These papers considered specific families of estimators and did not deal with 
second-order efficiency among all estimators. Golubev and Hárdle [9, 10] studied 
partial linear models and suggested second-order efficient estimators as well as 
their nonparametrically adaptive versions. These results rely strongly on the lin- 
earity and additivity of the parametric component in partial linear models. The 
problem of how to treat second-order efficiency for essentially nonlinear models 
has remained open, and our aim here is to give a solution to this problem. 

We restrict our study to one basic model that seems to capture the main diffi- 
culties in deriving second-order efficiency, being at the same time simple enough 
to avoid unnecessary technicalities. Namely, we consider the estimation of a shift 
parameter Ó based on observations 


(1) x^ (t) = f(t — 0) 4- en(t), t €[—1/2, 1/2], 


where n(t) is the standard Gaussian white noise process on [—1/2, 1/2] (cf. [16], 
Chapter 3) and f (-) is a smooth symmetric [i.e., f(t) = f (—t), Vt] periodic func- 
tion with period 1, and 0 < € < 1 is a known noise parameter. With & = 1/./n, 
where n is an equivalent sample size, model (1) can be viewed as a "Gaussian 
white noise analog" of the classical symmetric location model [2, 26, 27]. 

If the signal f is known, the maximum likelihood estimator 


: 1/2 
Ôv = arg max | f (t — )x* (t) dt 
T —1/2 


is locally asymptotically minimax (e.g., [17]). In particular, its mean square risk 
satisfies 


(2) lim sup Ee, f[(ÔmL — 6)^1*(£)] = 1, 
£-00e9 
for any sufficiently small interval O, where 
72 
rays? [ ,U' OP ar 
—1/2 


is the Fisher information associated with model (1) and Es, is the expectation 
with respect to the distribution of the observation X° = (x*(t), t € [—1/2, 1/2]) in 
model (1). The corresponding probability measure will be denoted by Ps ;. 

In a semiparametric setup where f is not known, an efficient and S-adaptive 
estimator of 0 is suggested by Golubev [8] for a model close to (1) where the ob- 
servations are available for all t € R and f is not periodic. Hardle and Marron [14] 
discussed semiparametric estimation for models with discrete observations similar 
to (1) involving also a scale parameter. 


172 A S. DALALYAN, G. K GOLUBEV AND A.B TSYBAKOV 


Here we construct an S-adaptive and second-order efficient semiparametric es- 
timator of 0 in model (1). It 1s of penalized maximum likelihood type with an 
appropriately chosen penalty. To derive this estimator, we introduce a prior on f 
and then maximize both in 0 and f the posterior density of f given the observa- 
tions. This procedure is of a Bayesian type w.r.t. f for fixed 0. It can be viewed in 
the following way: we "eliminate" the nonparametric component using a Bayesian 
argument, while the final estimation of 0 is realized by maximum likelihood. 

We conjecture that the penalized maximum likelihood approach using similar 
arguments would be a proper tool to get second-order efficient estimators for other 
semiparametric models, and we believe that our technique of proving minimax 
lower bounds with second-order terms might be useful there as well. 

This paper is organized as follows. In Section 2 we give some heuristics con- 
cerning the first- and second-order efficiency in model (1). Section 3 contains the 
argument leading to a class of estimators defined by a sequence of weights: we 
show how these estimators (that are of penalized maximum likelihood type) are 
derived from Bayesian considerations. In Section 4 we show that, under certain as- 
sumptions on the sequence of weights, the estimators from this class are S-adaptive 
and we study their second-order asymptotics. Section 5 discusses a minimax prob- 
lem for the second-order term. In Section 6 we give a locally asymptotically min- 
imax lower bound and suggest a second-order efficient estimator obtained with a 
particular choice of weights. Sections 7—9 contain the proofs. 


2. Some heuristics. This section provides some useful heuristics about first- 
and second-order semiparametric efficiency 1n model (1). 

We first explain the result (2) obtained for known f. An intuitive way to do this 
1s based on a local linear approximation of the signal f(t — 0). Suppose that 0 
belongs to a small interval [09 — Aes, 8o + Ae], where A, > 0 and 09 are known 
and A, — 0 as £ — 0. This assumption is essentially equivalent to the existence 
of a A,-consistent estimator of 0. For simplicity, we assume that A, ~ e [for 
rigorous proofs one needs to take A, slightly larger than £, so that A;/& > oo, 


as € > 0, e.g, A, = eJ log(e-?) ]. Then, replacing f(t — 9) in (1) by its linear 
approximation f(t — 09) — f'(t — 09)(0 — 9), we get the linear model 
(3 xp(0-f(t-—89) — f'(t — 00)(@ — 69) + en(t), f € [—1/2, 1/2]. 


When f is known we can subtract f (t — 69) from these observations, thus obtain- 
ing an equivalent model, 


y*(t) = f'(t —069)(0 —09) ten(t), — te[—1/2, 1/2]. 


Estimation of 0 — @ in this linear regression model is now straightforward. 
Multiplying the observation y*(t) by f'(t — 09), integrating over the interval 
[—1/2, 1/2] and dividing by 7*( f) we get the Gaussian shift model 


(4) YE 26 — 69 + (I5 (f)] "^£, 


SEMIPARAMETRIC SHIFT ESTIMATION 173 


where £ ~ N (0, 1). Clearly, Y* is an efficient estimator of 0 — 69. Thus, the ar- 
gument here is based on replacing the original nonlinear estimation problem by 
a Gaussian shift experiment. A deep theoretical background for this argument is 
given by Le Cam's theory of asymptotic equivalence [19]. 

Suppose now that f is an unknown symmetric function. Then again we can use 
model (3) to approximate the initial model (1). But the approximating model is 
now nonlinear since it contains the product of unknown parameters (0 — 09) and 
f' (t — 09). Fortunately, this is not a problem, and in this case one can also construct 
an efficient estimator. 

Indeed, since f' is an odd function and f is an even function, projecting the 
observations (3) on the spaces of even and odd functions we get 


(5) Xe (t) = f (t — 89) + ene(t), 
(6) x5 (t) = f'(t —69)(8 — 09) + eno(t), 


where no(t) and ne(t) are two independent Gaussian white noise processes. Based 
on x£(f), we estimate the derivative f'(t — 69) and then plug this estimator into (6) 
to recover the parameter of interest from the observation x5 (t). This allows us to 
obtain an efficient (S-adaptive) estimator of 0. 

We turn now to a heuristic derivation of second-order asymptotics. In order to 
do that we simplify our approximate statistical model (5)-(6) assuming that 09 = 0 
and translating the observations x£ (t), xz (t) in a sequence space. 

We will suppose throughout the paper that the unknown function f can be rep- 
resented as 


oO 
(7) fO — 42^ fecos(2rkt), 
k=l 
where the Fourier series converges for all t and the Fourier coefficients fg are 
defined by 


1/2 
fk v2 [ M f (t) cos(2zx kt) dt. 


Using this and projecting (5) and (6) on the trigonometric basis functions we obtain 
the sequence model 


(8) Xi = fk + eb, k=1,2,..., 
(9) X} =OQmk)fitetf,  k=1,2,..., 


where (&,é, k = 1,2,...) are iid. standard Gaussian random variables. The 
nuisance parameters fg can be estimated from (8) by well-known techniques for 
the Gaussian sequence model (see, e.g., [29]). In particular, it is natural to use 
linear estimators of fg defined by fk = h,X;, where hy = hg(e) are such that 
YO AZ «oo. An example is hy = Lik<y,} where 1, is the indicator function 
and N, is an integer such that N, > co as € — 0. 


174 À. 5 DALALYAN, G. K. GOLUBEV AND A.B TSYBAKOV 


Next, considering separately model (9), it is not hard to show that if fy were 
known the maximum likelihood (least squares) estimator 


(10) By = Y Ono Xt J Qno? fe 
kl kel 


would be asymptotically minimax for 0. At first sight, it seems natural to plug in 
fx instead of fp in the expression for Or, thus obtaining the estimator 


OO 00 

(11) 6 = YO Or kK)hkX X} / Y Qn LX; 
k=l k=l 

However, this estimator is not optimal: it can have a very large bias. The reason is 

that the functional $z- (2x Kk)? ft in (10) is not estimated correctly. An improved 

version of Ó can be suggested in the form 


(12) 0* = Y 'Qxioh XuXt | Yk) hk (X — 8). 
k=] k=l 


As compared to (11), we replace h? by Ah, in the denominator and replace XC by 
the unbiased estimator Xe — £? of i This turns out to improve significantly the 
asymptotics of the risk. 

We now give a heuristic analysis of the risk of 0*. Using (8)-(9) and the notation 
F/I? = e? If) = De rk)? ff, we obtain 


* / X err 
13 0" —@),/ TF = nn Y 
(13) ( WE = IF ll Y? xh fd TS 


where 


x° =Y QzEb)h, fi&t +e Y (Qrk heb kf, 
k=] k—l 


OO CO 
rf 20 Y Qzk)*h, fék + 06 Y (20k) hkk — 1), 
kl k=l 


OO CO 
T$ = 2e Y 'Qx k) hy fake +67 Y Qr) hy (Eg — 1). 
k=l k=l 


In order to simplify the expression in (13) we assume that FDZ, Qt K)* f? < oo 
and that hg are chosen so that € Y QQzk)^ hy « oo. Under these conditions, 
using |8| < Ae ~ €, one obtains that 


Eo, siTe) ] = O(c”), — Ee, fiT] = O (e). 


SEMIPARAMETRIC SHIFT ESTIMATION 175 


It is also straightforward to see that Eg c (x ^T'1) = 0, and to show, with some easy 
algebra, that 


OQ oo 
Ee, r[(x ^Y? T5] = 4e? Y^ h; Qn ^ fg +204 V! hi (ark)? = O(e?). 
k=] k=1 


Next note that we are allowed to drop the terms of order O (e?) since their contri- 
bution in the risk (asymptotically, in the mean absolute value) is smaller than the 
final second-order asymptotics that we are going to obtain. Up to these terms, we 
get from (13) 


ms 
: LF’ —."—9À —— 2 j | 
0* —0)./ I* A ———— -I —¥°T 2zk)*h f 
(0* — 0) J I*(f) ENDE a*l Psy p Donk) ity 
and thus 


oO —2 
Eo, c[(0* — OY I*Cf)] WF? $ Orth fe) Eo, Fixo] 


=| 
OO le 4) e» 
= fF Y Ork hz + fe) » DU) 
k=] zl 
This expression can be simplified if we assume that 0 < hy < 1 and 


cO 2 29 
(14) P» - he Ok)? fe =o( doa -A Qx i fr | 
=1 


=] 


as € — 0. Then, in particular, 77°) (1 — hg) (27rk)* f = o(1), and one obtains 


OO ne oo Ex 
Yers) = (ir + X Ork)? (he — ve) 
e] 


k=l 
x TRE — 2 F/T? 3 rk)? (hy — née} 
k=l 


Using this and (14) we derive the following expansion for the risk: 


Eo, ;[(0* — 6) I*CF)] © f fr (ertt + (hi — D) 


=ł 


(15) x [ = 2 F 9 rk) he — e 


k=l 
e 1+ || F IT? R*f, h], 


176 A. S. DALALYAN, G. K. GOLUBEV AND A B. TSYBAKOV 


where 


oo 
(16) R*Lf, h] Y (mk) I — Ro? fg + e? hi1]. 
kl 
The second-order term in (15), that is, the functional || f’? R*[ f, A], has a clear 
statistical meaning. Suppose that we know 9 and we want to estimate the derivative 
f'(t — 0) based on observations (1). To measure the quality of an estimator f’(t — 
0) we choose the relative mean integrated squared error, 


Er(f^, f) = : Eo / a [f(t — 0) — f'(t — 6) at 
vU WP OS Lao 
Consider a linear estimator 


- xc 1/2 
fit —6) « 23 hk Qn) sin2zk(t — 6)] | "ski — Oe) dt. 
k=1 2 


Using (7), it is easy to show that Err( s f)zlf I7? RELF, A]. Thus, the expres- 
sion || "l| ^ R^[ f, h] is a relative mean integrated squared error for nonparametric 
estimation of the derivative of f in the Gaussian white noise model. We see that 
the second-order expansion (15) relates two statistical problems: semiparametric 
estimation of 0 and nonparametric estimation in L2-norm of the Fisher informant 
f'(t — 9). It also reveals a presumably general fact that second-order asymptotic 
terms in semiparametric problems account for the mean integrated squared error 
of recovering of the Fisher informant. 


3. Penalized maximum likelihood estimator. In Section 2 we have sketched 
second-order asymptotics for the estimator 0* in model (8)-49), which is only a lo- 
cal approximation of the original model (1) in a neighborhood of 6o = 0. Thus, 6* 
is not directly applicable for model (1). Of course, the procedure can be corrected: 
instead of replacing 09 by 0, one should replace it by a preliminary &-consistent 
estimator of 0. This would lead to a two-stage estimation procedure that would 
presumably have the desired second-order behavior under some conditions. There 
exists, however, a direct and more elegant estimator achieving the same result. This 
estimator is inspired by the Bayes argument that we are going to describe now. 

Given model (1), we have at our disposition the following series of discrete 
observations: 


xg = fy cos(2zt k0) + EEk, 
xp = fysin(2xk0) + e&r, | al es a 


Here (£k, £r, k= 1,2,...) are i.i.d. standard Gaussian random variables, 


(17) 


1/2 12 
epee)? f P OcosQrkDde, — xpo a | 4 OO sinQrko di, 


SEMIPARAMETRIC SHIFT ESTIMATION 177 


and (17) is obtained by projection of (1) on the trigonometric basis functions on 
[—1,2, 1/2] using (7). 

Our aim is to define a suitable estimator of 0 using these observations. A general 
idea is to "eliminate" first the nonparametric component of the model represented 
by the sequence of Fourier coefficients fg (which we consider to be nuisance pa- 
rameters). We will proceed as follows. Assume for a moment that ne fs are in- 
dependent zero-mean Gaussian random variables with variances og. Assume also 
that they are independent of the noise sequence {&, £7]. We will replace the se- 
quence {fz} by the most probable, with respect to the posterior distribution of ( fx) 
given (xx, xz), sequence {f}. Clearly, this sequence will depend only on (xx, xz} 
and 9, and thus {fg} will be eliminated. The final step will be to maximize over 0 
the remaining likelihood, thus obtaining an estimator of 0. 

To define the procedure formally, note that the problem factorizes: it is sufficient 
to find f7"'s for a fixed k, since the triples xx, xy, fx with different k are indepen- 
dent Maximizing over fg the posterior density of fy given xx, xj is equivalent to 
maximizing the joint density of xy, xf, fk, which equals 


] x Fe 
ot (de 
P8iXk, Xg fk) Js OL p 202 





(xk — fy cos(2zr k0))? + (xf — fx sin(2zk0))? 
xag [ SE — emer at Ama] 
= A(xk, xg) 
Zine 2 
x exp| A [^^ cos[2zrk(t — 8)]x* (t) dt — d 
~1/2 2E°O; 


where A(xz, xp) does not depend on fy and 0. The maximizer of po (xx, xz, fx) 
over fg has the form 


1/2 
fr @= J 2Ak fin cos[2r k(t — 0)]x* (t) dt, 


2 





d 
where A; = 5 and 
or +e? 
max po (xy, xy, fk) 
fk 
(18) = pa (xk, xz, fg (0)) 


1/2 2 
= Å (xk, XK) exa (i cos[2zr k(t — 8)]x* (t) at) | 


Set 


OO 
ÖpML = arg max IT po (xk, xp, fg (0)) = aga max | | pee xg, 2 
OEO kal 0c0 fx) kol 


178 A. S. DALALYAN, G K. GOLUBEV AND A B. TSYBAKOV 


where © is a parameter set associated with the model. Thus, ÔpmL is the 
0-component of the overall maximum likelihood estimator corresponding to the 
infinite product density [7-., pe (xk, xz, fx). In view of (18), we may write this 
estimator as 


A o9 1/2 2 
] = À 2nk(t — (t)d 
(19) OPML ema} 9 (fi, cos mkt — v)|x ' (t) r) l 
OI as 


A ? 0o 1/2 
Ha = max man 7 ) ex | cos[2zrk(t — t)]x*(t) dt 
p E 
k=1 


TEO {gx} 
(20) 


where max{g,} denotes the maximum over sequences {gg} belonging to a subset 
of £2, and we suppose that f satisfies conditions such that the infinite sums con- 
verge almost surely. We will call @py_ a penalized maximum likelihood estimator 
(PMLE), although this is not a PMLE in the usual sense. Comparing ÔpmL with 
the maximum likelihood estimator Gua. we see that Opi can be interpreted as a 
penalized version of Omi. corresponding to a function f(-) = f;(-) whose Fourier 
coefficients are the maximizers {g;(t)} of the term in square brackets in (20) over 


(gx) for fixed v and to the penalty 92^ , (97 (2)? C ^. + T (up to a multiplica- 
k—l*'ók 2c 2a; 


tive constant, cf. definition of burr). Thus, the difference of ÕPML from the “pure” 
PMLE is in the fact that f (-) = fr(-) is not fixed and known: it depends on the 
parameter t over which the maximization is carried out. 

To make the estimator Bow. feasible, it is natural to consider only finite sums 
in (19), (20), including the terms with k < N;, for some N, that depends on e 
and tends to oo as € — Q. In particular, this will be the case for the second-order 
minimax estimator that we derive below. 

Note that the estimator (12) defined in Section 2 is nothing but a local version 
of the estimator (19) in a neighborhood of 09 = 0. In fact, differentiating formally 
the expression in curly brackets in (19) we obtain that 6pm is a solution of the 
equation 


Dy Ax (21k) (| cos[2z k(t — t)]x*(t) at) 
(21) = 
x (f sin[2zk(t — v)]x* (t) at) = 


The integrals in (21) are equal to yy = xy cos(2wkt) + xf sin(2wzkt) and y, = 
Xp cos(2zt kv) — xy sin(2zt kt), respectively, allowing one to reduce (21) to 

00 

X àk (20k) (xkxz cos(4zr kt) — [x? — (xt)^] sin(4zr kv)/2) =0. 

k=l 


SEMIPARAMETRIC SHIFT ESTIMATION 179 


Linearizing this equation in the vicinity of t = 0, we get the following approximate 
formula for a solution of (21): 


(22) Gp, © Qro eet / Y Ork helak — (xf)? I. 
k=] k=] 


It can be shown, using the argument from Section 2 that (22) is asymptotically 
analogous to the estimator 0* given by (12) with hy = A4. One difference is that 
here we have xg, xý instead of Xj, Xj, but xy œ~ Xy and xf ~ Xf for 0 close 
to 0. Another point is that these estimators have somewhat different denominators. 
However, for small 0 both denominators estimate the same quadratic functional 
Ys (2k)? f£ and one can show that they are quite close to each other, so that 
their difference does not appear in the second-order asymptotics of the risk. 


4. Second-order asymptotics of the estimators. In this section we consider 
the class of estimators defined by 


oo 1/2 2 
Yn | | estar — p (dr) l 


(23) Bap = arg nes 
k=] 


rec 


where {hk} is a sequence of real numbers satisfying some general conditions. For 
a particular choice hy = A, the estimator Bap is equal to the penalized maximum 
likelihood estimator (19) obtained from a Bayesian argument with A; = of / (og + 
£^), but we also allow other weights hg. In particular, the weights (A4) such that 
hy = 1 for some initial values of k play an important role in our further argument, 
while we always have A, < 1 for bom. 

We will show that under some assumptions on {hx} the estimator ÓAp is 
S-adaptive and we will give explicit second-order asymptotics for the risk of Bap. 
In what follows we will suppose that hy Æ 0 for only a finite (typically, depending 
on £ and growing to oo, as € — 0) number of integers k. This assumption is nat- 
ural, since otherwise the estimator @ap is not feasible. In order not to specify the 
set where hg + 0 we keep in the notation the sums over all integers k. 

We first define the parametric set O where 0 lies. Since f is symmetric and 
periodic with period 1, we get that s(t) = f (1/2 — t) is also symmetric and pe- 
riodic with period 1. Hence, the observations x*(t) corresponding to parameters 
(0, f (-)) and (8 — 1/2, s(-)) have the same probability distribution. So we can- 
not discriminate between values 0,0 + 1/2, 0 + 1,... in model (1) if we suppose 
that f belongs to the class of symmetric and periodic functions with period 1. In 
order that the model be identifiable, © should be strictly included in an interval of 
length 1/2. For definiteness, we assume the following. 


ASSUMPTION Al. © = {0:]0| < to} where 0 < to < 1/4. 


180 A S. DALALYAN, G. K. GOLUBEV AND A.B. TSYBAKOV 
Next, we define the class of functions F where f lies. Let p and Co be positive 
constants. Denote by F = F(p, Co) the class of all functions f :[—1/2, 1/2] > R 
that admit the Fourier expansion (7) with coefficients fy satisfying the following 
assumptions. 

ASSUMPTION A2. f?> p. 

ASSUMPTION A3. || f” < Co. 


Here and in the sequel, for a sequence of real numbers {ag}, we use the notation 


OO CO OO 
lal^- Y ag, — ded = So ank, had I = ag rk)". 
k=] kl k=] 
Assumptions A2 and A3 imply that 
(24) Coz IFE > Qrp  VfeFrF. 


Furthermore, we impose some conditions on the weight sequence {hg}, assuming 
that it depends on €. 


ASSUMPTION B. The weight sequence {hg} is such that hy = 1,0 < hy x 1 
for all k, and 


B1. ||h'| > py log*(e7*) max; hy (27k), where p; > 0 is a constant that does not 
depend on e, 
B2. e? Res] Pk (2x k)* < C1, where C is a constant that does not depend on €. 


We remark that the condition 0 < Ay x 1 here is quite natural: if A, ¢ [0,1], 
projecting ^; on [0,1] only improves the second-order asymptotics (cf. the ex- 
pression for R*[f,h] in (16)). Note also that Assumption B2 and the fact that 
0 < hy x 1 imply the finiteness of ||h'|| for any €. Assumptions B1 and B2 are not 
very restrictive. For example, consider the projection weights hy = 1(,« x, where 
Ne is an integer such that N, — co as € — 0. Then Assumption B1 is equivalent 
to Ns > C logt (e7?) for some constant C > 0, and Assumption B2 is satisfied 1f 
N, = O (e725) as e — 0. 

Finally, we will need the following assumption involving both f and {hg}. 


ASSUMPTION C. The weight sequence {Ag} is such that, uniformly in f c F, 


oo 2 oo 
VOA — Ax) (20k)? fe =0| Y ü — hy)? (20k)? ie) as e > 0. 


Note that, again, Assumption C is quite mild. For the projection weights hy = 
Lik<N,} it means that 2 kN. Qxk) f? — 0 as e > O, uniformly in f € F, which 
is true due to Assumption A3. 


SEMIPARAMETRIC SHIFT ESTIMATION 181 


THEOREM 1. Let Assumptions A1—A3, B and C be satisfied. Then, uniformly 
in f € F and in 0 € 0, 
R*[ f, A] 


0, 
TZ = 


Eo, [Gap — 0V I*(£)] = 1+ (1-- o(1)) 


where the functional R*[ f, h] is defined in (16). 





Proof of Theorem 1 is given in Section 7. 
Assumptions A3, B and C imply that 
sup R*[f, h] = o(1) as € — Q. 
fer 
In fact, it follows from Assumptions B1 and B2 that e? pas, hj (20k)? zl). 
while Assumptions A3 and C yield EZX. (1 — h) Qark)? f? = o(1) as e > 0. 
Thus, Theorem 1 shows that Ban has the same first-order asymptotics as the ef- 
ficient estimator Oy. [cf. (2)], that is, bap is S-adaptive under the assumptions 
of Theorem 1. But Theorem 1 says more than that, because it also provides an 
asymptotically exact second-order expansion for the risk of Bap. 


5. Minimax problem for second-order term. It follows from Theorem 1 
that the second-order term of the risk of Bap depends on the coefficients {hy} only 
via the functional R*[ f, h]. We would like to make this term as small as possible 
by minimizing it over hy. Since we do not know the nuisance parameters fg we 
consider a minimax setting: we look for h = {hg} that minimizes the maximum of 
the functional R*[f, A] over a suitably chosen set of sequences {fz}. Namely, we 
consider a Sobolev ball 


W(B, L) = |/ rose? i < Ld 
k=] 


where B > 1 and L > 0 are given constants. A minimax sequence of weights q = 
(qx) € £2 1s defined by 


sup AR'[f,q]- inf sup  R'[f,h]. 
f€W(B,L) helo f e'w(B,L) 


It is well known (see, e.g., [1] or [23]) that such a sequence q exists and it has the 
form 


po 
ad a=|1- (37) |; 


where x, = max(x, 0) and W, is a solution of the equation 


oo B—1 
(26) e? DE) ~- 1] Grip? = L. 


182 A S. DALALYAN, G. K. GOLUBEV AND A B. TSYBAKOV 


As € — 0, we have 


L (B +2)(2B + 1)\'/@8+D 


Moreover, the functional A*[f, h] has a saddle point on W(8, L) x £2 (cf. [1], 
[23] or [29], Chapter 3) with components s,g, where s = {sg} is any sequence 
satisfying 


B—1 

dk 2 We 
(28) $2 = g? mti mg C) -1| | 
d 1 — qk k + 


The existence of a saddle point at (s, q) means that 


inf sup  R'[fA]-— sup inf R°[f,h]=R*[s, q]. 

het) f e'w(B,L) few(g,L) het 
Using (25), (26) and (28), the value R*[s, q] can be expressed explicitly, which 
yields 


OO 

(29) inf sup R'[fh]- sup R'[f,q]-&? Y QxkY qu Sr’. 

het? fe'w(B,L) feW(B,L) k=l 
Note finally that, as € — 0, 

(27r)? (8 — 1) diss 
rË = ————————&*W:(1--o(1 
3(B +2) e () 

(30) 


= C*(B, Lg UP -9/G8*D(1 + 9(1)), 


where 
pl y 


2x (B + 2) 


The rate e4?—4)/(26+1) in (30) characterizes the ratio of second-order terms to first- 
order terms in the asymptotic expansion for the nonnormalized risk Eg, nu — 
9)*]. This ratio is not dramatically small for £ not too large; for example, it equals 
£*/? for B = 2. Thus, the second-order terms might be comparable with the first- 
order ones. In absolute value, the first-order term of nonnormalized risk decreases 
as e? and the second-order term as e(88—2/08-* 1) 


1 
C*(B, L) = ;( (LOB gy) ep 


6. Locally minimax lower bound and second-order efficiency. In this sec- 
tion we obtain a lower bound for the minimax risk and construct a second-order 
efficient estimator of 8. 

Let f be a fixed function from F(p, Co) with the Fourier coefficients denoted 
by f. For 8 > 0 define a vicinity of f by 


(31) F3(f) ={f =f +v: lvl] <8, v e WB, L)). 


SEMIPARAMETRIC SHIFT ESTIMATION 183 


It is assumed that f > 2. Recall that || "|| < oo since f € F(p, Co) (cf. Assump- 
tion A3). If à is small enough, Fs(f) € F(p’, Cj) for some p’ > 0, Co > 0 de- 
pending only on p, Co, L. 


THEOREM 2. Let the real number à = ô; be such that lime ,904, = 0 and 
lim,-,082/(£?W1**) = oo for some a > 0, where Ws satisfies (27). Then, as 
€ — 0, 

r* 
(32) inf sup _ Eppie -oP I*(£)] 2 1-- (19-000) ——. 
Os 96@, fe Fs, (f) Il. £^ ll 
Here and in what follows inf; (or inf) is the infimum over all estimators based 
on the observation X* , and r* is the minimax value defined in (29). 


The proof of Theorem 2 is given in Section 8. 
Motivated by the above results, we introduce the following notion of semipara- 
metric second-order efficiency. 


DEFINITION 1. An estimator 0* is called second-order efficient at f € F if 


(33) sup  Eel[(02 -6YI'()—1-(14o(D)—— — ase +0, 
9€@, fers, (f) IM 


for some 6, > 0 such that lime_,o9 ôe = 0. 


Comparing Theorems 1 and 2 we see that if there exists a sequence of weights 
hy =A; for which Assumptions B and C are satisfied and 


(34) sup R'[fA*]xr*(1-- o(1)), 
fEFs (f) 


where A* = (A1), then the estimator ap with this choice of weights is second- 
order efficient. At first sight, it seems that one can take A; = q, from (25). How- 
ever, for hy = qx Assumption C is not fulfilled. Therefore we correct gg, taking 


l, k < YeWe, 
=| 


1 A k 
IG) dp te 


where W, is a solution of (26) and y; = 1/log(e~*). For k > y, W;, the weights 
At induce a prior on {fg} analogous to the one that appears in the proof of the 
lower bound of Theorem 2. The corresponding penalized maximum likelihood 
type estimator has the form 


(35) yay (f 7 cosD2rck(t — e^ (0d). 
Op, = arg max 2 h T 1 tT) |x" (t) ) 


TEQ 


184 A. S. DALALYAN, G. K GOLUBEV AND A.B TSYBAKOV 


THEOREM 3. Let a function f € F be such that, for some p > B > 1, 


OO 
(36) Y (2k)? fà < oo 

kl 
and lim,_.9 ôs = 0, lime-+0 5, / (e^ Wt) = OO, for some a > 0, where W; satis- 
fies (27). Then, as ¢ — 0, the local asymptotic minimax risk admits the second- 
order expansion 


G) inf sup  Eoj((& —0) °(fJJ=1+(1+0(1)) 5. 
Os 6c, fe Fs, (P) IF 


Moreover, the estimator 05,4. defined in (35) is second-order efficient at f: 
The proof of Theorem 3 is given in Section 9. 


REMARK 1. Theorems 2 and 3 are local in f and nonlocal in @. Inspection 
of the proofs shows that they can be turned into local ones in 0 as well, that is, 
that one can replace supgeg by supyg_¢)<, where t > 0 is a small number (fixed 
or tending to O with & not too fast) and 09 is an interior point of ©. 


REMARK 2. In the argument of Section 3, A, = B (of + £°). The values 
(of) corresponding to At for k > ye Ws are thus 


x g^ W B-1 
x42 k 2 £ 

= = — —1li. 

Mp qe Ce) E 
One can interpret these (of)* as variances of the prior distributions of the f;’s 
introduced in Section 3. These variances appear also in the proof of the lower 
bound [cf. (46)]. The fact that the initial values of Aj are equal to 1 means that 
we do not put any prior distribution on the Fourier coefficients fg for k < y, We. 


Note that this is a particular choice of a prior associated with the Sobolev classes 
of functions. 





REMARK 3. Itis interesting to compare results on nonparametric and semi- 
parametric second-order efficiency. Golubev and Levit [11, 12] and Dalalyan and 
Kutoyants [6] considered nonparametric problems where there exist ./n-consistent 
first-order efficient estimators (such as estimation of the cumulative distribution 
function). In these problems there are simple efficient estimators, as the empirical 
c.d.f. and smoothed estimators allow one to improve upon these simple estimators, 
so that the second-order asymptotic terms are always negative. On the contrary, 
in semiparametric problems, as in the one considered here, simple empirical esti- 
mators are not efficient, and one has to use smoothing already to attain first-order 
efficiency. As we see from Theorems 1—3 (cf. also Golubev and Hardle [9, 10], 


SEMIPARAMETRIC SHIFT ESTIMATION . 185 


who studied partial linear models), in semiparametric problems second-order as- 
ymptotic terms are positive, so that they always spoil asymptotics. This suggests 
that the choice of correct smoothing that allows one to optimize second-order as- 
ymptotic terms is more important in semiparametrics than in nonparametrics. 


7. Proof of Theorem 1. In what follows we use the same notation C for finite 
positive constants that may be different in different occasions and can depend only 
on 10, 0, Co, p1 and Cj. 

The first step of the proof of Theorem 1 is to show that estimator ÓAp is 
€-consistent. 


7.1. Consistency of Bap. The estimator ÓAp is a maximizer of the contrast 
function 


oo 1/2 4 
L(t) = pe wa f cos[2z k(t — x)]x" (t) dt 
= —1/2 
= »» hg( fk cos[2zk(x — 8)] + cér (0) cos kt) + e&z (0) sin(2r kc) 
k—l 


hy ff cos? [2 k(x — 0)] + 2e] f'lim (t) + e?na (x), 


1 
Me 


k=1 
where @ is the true value of the parameter, 


1/2 
E(u) = J2 J " cos[2zr k(t — u)]n(1) dt, 


J 
Er (u) = vif í sin[2z k(t — u)]n(t) dt 
—1/2 
and 


ny(t) = TET > hy fk cos[2xk(r — 6)](E, (0) coss kc) + £e (0) sinZzrkz)), 
k=] 


n2(t) = Y^ he (Ex (0) cos(2rkr) + Ef (0) sin2zrkz))". 
k=] 


The following three lemmas allow us to control the first derivatives of nı (t) and 
n2(T). 


LEMMA 1. Uniformly in f € F we have 
Pf sup Ini o)l > d <c] exp(—cax^) V x >0, 
reo 


where the constants c > 0 and c2 > 0 depend only on p and Co. 


186 A. S. DALALYAN, G. K. GOLUBEV AND A.B. TS YBAKOV 


PROOF. Note that rj, (7) is a stationary Gaussian random process with mean 0 
and twice continuously differentiable correlation function r (-) such that r" (0) Æ 0. 
It follows from the Rice formula ([18], Theorem 7.3.2, or [5], Chapter 13.5, page 
294; see also Proposition 2 in [28]) that for all x > 0, 


2 
(38) PÍ sup Ini (tv) > | < cl "@/ro)? + 1 exp( - ) 


x 
rE 2r*(0) 


where C > 0 is a universal constant. Now, since f €F, 


r^(0) 2 EM 07] 7 Ifl ^ Y ^ he fe Qu) = Qm fl p, 
k=l 


OO 
(r^ (^ = Elnt C1 = WF De fe ky" x Coll, 
k=] 
which together with (38) proves the lemma. |! 


We will use the following simple fact about moderate deviations of the random 
variable: 


G= Ya (E? ud 1), 
=] 


where the &,’s are 1.1.d. standard normal random variables and {ax} is a sequence 
belonging to £5, so that the random series converges almost surely. 


LEMMA 2. Let ay #0, {ax} € £2. For any 0 < x x |la||/ max, lak| we have 
P(Ic| > x/ EIc?]] <2exp(—x/16). 
This result follows, for example, from (27) of Lemma 2 in [4]. 


LEMMA 3. Forany 0 <x x |h/||/ maxy hy Qt k) 


00 
P| sup m) > 4  h Qn) + il < 4exp(—c3x”), 
teo k=l 


where c3 > 0 is a universal constant. 


PROOF. Using the Cauchy-Schwarz inequality we get 


sup |n; (z)| <2 9 ^ hy k) sup{|&, (0) cos(2x kc) + Ef (0) sinQrkz)| 
T E] T 


x |—&(0) sin(2zkz) + £p (0) cos2zrkr)1) 


x 2Y hy (2sk) [if (0) + £720). 
k=] 


SEMIPARAMETRIC SHIFT ESTIMATION 187 


The rest follows from Lemma 2. O 


Consider now the expectation of the contrast function L(-) 
OO 
E[L(t)] = X hy fe cos^[2zrk(x — 0)]. 
k=1 
LEMMA 4. Let Assumptions A1—A3 and B be satisfied. Then 
E[L())-E[L(9))--C|r—-0|] YVre®, 
where the constant C > 0 depends only on to, p and Co. 
PROOF. The derivatives of the function G(r) = E[L(x)] satisfy G'(0) = 0, 
G"(9) = —2Y 8, hk fe Qn k)? < —2(27)*p. Thus, the assertion of the lemma 
holds for t in some neighborhood of 0. Since also E[L(r)] < E[L(@)] for all 


t € 9, t x 0,and © is a bounded interval (cf. Assumption A1), the lemma follows. 
C 


Now we are ready to show that Aap is £-consistent. 


LEMMA 5. Let Assumptions A1—A3 and B be satisfied. Then, uniformly in 
feFandin0 € Q9, 


Po, s {lĝan — 6| / I5Cf) > x} < caexp(—csx^) 


for all x € [xo, ||h' || / max; hy Qr k)], where c4 > 0,cs > 0, xo > O are constants 
depending only on to, p, Co, C1. 


PROOF. Due to Lemma 4 we have 


Po. ¢{lOap — 0| J I*Cf) > x] 


<r] [LG) — L(0)] > o} 


max 
:€8.|[v—0|»x/ VE 


IA 


P| max [ELL (r)] — EL[L (8)] 
:€8.|r—0|»x/ VE 


+ 2e|| f (mr) — m()) 


+ e? (go(c) — m(8))] = o} 


<P 


| max | ELLO) — E[L(0)] 
tEO: [r-0|»x/A/ I*(f) 


188 A. S. DALALYAN, G. K. GOLUBEV AND A.B. TSYBAKOV 
+ |r — 6 (261 fan in (| 
+ ©? max hil). > o} 
< P | max Ini (D + e max n OL > cx} 


P : À 
< {max maiz cx} +P{emax ima Cx} 


The first probability on the last line is controlled by Lemma 1, whereas the second 
probability can be bounded, in view of Lemma 3, by 4exp(—Cx?), since according 
to Assumption B2 one has £ 9 7-4 hy rk) < C', e||h/]| < C’, where C’ depends 
only on C4, and thus Cx > 4e 9 1-4 hy rk) + cex||h']| for any x > xo if xo is 
large enough and c > 0 is small enough. L1] 

7.2. Proof of Theorem 1. Let us introduce the event A; = {lĝap — 8| < 
cge  log(e-7) } where cg > 0 is a sufficiently large constant that can depend only 
on to, p, Co and C,. The risk of Bap can be decomposed into two terms, 

(39) Eg, s[(@ap — 6)"] = Ee, ¢[@ap — 0)? La] + Eo, [Gav — 0Y tac]. 
Using (24) and Lemma 5 we find that, for cg large enough, 


(40) Ee, [Gap — 9 I (f)1. 4c] < Ce ?Po, ; (^1) = O(c?) ase 0. 
Indeed, for x = e/ I*(f)log(e-?) > C,/log(e~), due to Assumption B1 and (24), 
one has 
x max, hy zr k) z JCo log ?/ 2 (g^?) 
[A^] ü pi 
Thus we can apply Lemma 5, which yields (40) when cg 1s large enough. It remains 


to find the asymptotics of the first term on the right-hand side of (39). The estimator 
Gap satisfies 


(41) L' (Oap) = 0. 


Using Taylor approximation of the left-hand side of (41) in a neighborhood of 6 
we may write, for some w € ©, 


(42) Lo(0) + (0 — Ó4p)L1(0) + 1 (0 — Gap? Lz(o) — 0, 


where 


—J» () as £ — Q. 


Lo(8) — € X hg (20k) fa&g (0) + e° V ^ hx )&t ODEO), 
k=1 k=] 


SEMIPARAMETRIC SHIFT ESTIMATION 189 


Li(8) = $ hyQz y! (fe + efi&(0) + e^ [EE (0) — E ON), 
k=l 


Ly(o) JL (2 e( | © cosi2k(t —0)) *(t) dr) 
0)—— T COS — QJix 
: k=l : ~1/2 


7 ( J " sin[2zrk(t — w)]x°(t) at). 


LEMMA 6. Let Assumptions A1—A3 and B be satisfied. Then 


sup Eej[(Li(6) — Eo, s[L1(6)])"]= Ole) ase 0, 
060, JEF 


and 


sup Ey, | sup |La(w)|"| « C. 
060, fer oco 


PROOF. We omit the proof of the first relation since it follows from simple 
algebra. To prove the second one, using trigonometric formulae and the Cauchy- 
Schwarz inequality, we write 


JA 








J estne — w)|x*(t) dt 


= | fk cos[2zt k(0 — w)] + cég (0) cos[2zr kw] + £& (0) sin[2rke]| 


< | fal + ey &(0)* + 57 (0)*. 


Similarly, 4/2] f sin[2zrk(t — c)]x* (t) dt| < | fel + eJ &k(0)? + £*(0)2. Therefore 


IL2(2)| < C Y^ hk? fd + Ce? Y hy (2k)? E20) + £^ (0)]. 
k=l k=l 


The second inequality of the lemma follows easily from this and Assumptions 
A3 and B2. 0O 


To analyze the behavior of @ap we compare it to the root 7 of the linear equation 


(45) Lo(0) + (0 — £)Ee, r[L1(0)] =0 
representing an approximation of (42). 


LEMMA 7. Let Assumptions A1—A3, B and C be satisfied. Then 
R*[ f, h] 

fll? 
where o(1) — 0 uniformly in f € F and in 0 € ©, as e — 0. 





Eg, Fi — 0)? 1*(f)] 5 14 (1 -- o(1)) 


190 A. S. DALALYAN, G. K GOLUBEV AND A B. TSYBAKOV 
PROOF. Using the inequality (1 — h?) f22xk)? < 2(1 — hy) f2Qx y^, As- 
sumption C and (24), we get from (43), 
Eo, c[(? — 0)*1*(f)] 
AE MET ORBE DSe + e^ HQ" 
HII DR ae — D fe Onk)? 
oO 
= f HISI YE -D + PRN? 


k=l 
X f — 2 f^ 7 Y hr — DAG Ok)? + o(R*L f, 2] 
k=] 
=1+(1+0(1)) f] ^ R*Lf, Al. [1 


LEMMA 8. Let Assumptions A1—A3 and B be satisfied. Then Ea. f (QA D-— 
£)?14,] x Ce*log^(e?). 


PROOF. Since no confusion is possible, we omit the subscripts 0, f of the 
expectation. Subtracting (43) from (42) we obtain 
(Gap — €)ELL1(8)] — (0 — an) (L1 (8) — ELL1(8)]) — 3(8 — GAD)^ La (o) = 0. 


Note that E[L1(0)) = Erh Onk? f? > (2x)*p and that Gap — 0)? < 
c2e" log(e~*) on A 1. Using these facts and Lemma 6 we get 


E[(GAp — £)*14,] 


< EIL (6)? | ELC — 8ap)*(L1 6) — ELL: (0)]) 1a] 


+ Es, lo — ÓAp)* sup Lao) Pta | 
oe 
< Ce*log*(e~). Š 


Now Assumption B1 and the fact that hı = 1 yield, for ¢ small enough, 


CO 2 
REL fh] > e Y Oni? g > ie (max hono) log*(e™) 
k=] 


> pi)" e^ logt (e^), 
which implies that Eo, ¢[(@ap — £)21*(f£)14,] = o(R*[, h]) uniformly in f € F 


and in 0 € ©, as e — O. This result together with (39), (40) and Lemma 7 com- 
pletes the proof of Theorem 1. 


SEMIPARAMETRIC SHIFT ESTIMATION 191 


8. Proof of Theorem 2. Before proceeding to the proof of Theorem 2 we give 
some preliminary results. 


8.1. An auxiliary Bayesian problem. We consider a model with two observa- 
tions that will be used as a building block for the subsequent proofs. Set 


x = focosQzrk0) +e, — x* — fosin(2zk0) + e£*, 


where £,£* are independent N (0, 1) random variables and fo is an W (f, o?) 
random variable that does not depend on (€, £*), with f eR, o? > 0. Here 8 is 
a parameter to be estimated based on the observations x, x* and k is an integer. 
Define the Fisher information 


d 2 
$6) =B| (7 log po.) |. 
where pg (x, x*) is the probability density of the observations. 
LEMMA 9. We have $£(0) — £^ (f? + 3°) Qm k),, for any k € Z. 
PROOF. Denoting by C multiplicative constants that do not depend on 6, we 


have 
2 


Pi — za% — f cos(2zk0) — u cos(2x k0)]" 


pe(x,x*) «C | exp] — 
eem f sin2xk0) — u sin(2ko) | du 
2E 
=C exp| “Lx cos(2zt k0) + x* sin k0)]? 
+ (1 — à) f [x cos(2zr k0) + x* sin(2k6))} 
À 1 
= -Ceplz; o z|* cos(2zr k0) + x* sin(2z k0) + a | 


where A = o?/(s? + o?). Hence writing fo = f -- no where n~ (0, 1) and n is 
independent of (€, £*), one obtains 


2 
92 (0) = E (5 E ls cos zr k0) + x* sin(27k@) + ae ) | 
- Dex 
in Gris? ( f +no + && cos(Qxt k0) + &£* sin(2z k8) + —— ) 


x (—e€& sin(2x k0) + e£* cos(2ark6))*| 


192 A. S. DALALYAN, G. K. GOLUBEV AND A.B TSYBAKOV 


— Orpea (am f +no)(—€ sin(2z k0) + £* cos(2k6)) 


2 
+ =" — E?^)sin(4z k0) + e&&* cos(Azrk) ) | 
= (2r k} e 212p 7? F? 4 o? + e]. d 


8.2. Lower bounds for Bayes risks. ln this subsection we consider the se- 
quence model (17) where we suppose that the fp’s are no longer fixed values but 
independent random variables distributed as W (fk, 07) with some o; > 0. By con- 
vention, a; = 0 means that the corresponding fj is equal to fy almost surely. We 
assume in what follows that o; > 0 only for a finite (and possibly depending on £) 
number of indices k. We also assume that the random sequence (fy, k — 1,2,...) 
does not depend on the noises (£4, 57, k = 1,2,...). We will refer to this model 
as the Bayes model with fixed 0. Let V, (df) denote the probability distribution of 
f = {fk} € £2 in this model. 

Along with this, we will consider the full Bayes model defined in the same 
way, except that in this new model 0 is supposed to be a random variable having 
a density z (x), x € ©, that vanishes at the endpoints of the interval © and has 
finite Fisher information I, = f(x’ (x))^zx ^! (x) dx. It will be assumed that Ó is 
independent of (fk, £k, £f, k — 1,2, ...). 

We denote by E the expectation with respect to the joint distribution of 
(xk, xp, k — 1,2,...) and @ in the full Bayes model and by Eg the expectation 
w.r.t. the distribution of (xy, xv, k — 1,2,...) in the Bayes model with fixed 0. 
Define 


2 
Of 


a= 
e+ og 


Km YAT 


LEMMA 10. Assume that the density m(x) vanishes at the endpoints of the 
interval © and has finite Fisher information I; . Then 


x Z A? 9e 
(44) inf E[(6, — 6)? I°] z 1+ js Y Qz b) A + O(&^, 
0: kl 


where 


P= | PP) Wolds) e? Y xb GE op 


k==] 


The proofs of this and subsequent lemmas are given in the Appendix. 

In the next subsection we will show that one can choose the sequence {ox} so 
that the right-hand side of (44) coincides asymptotically with the lower bound 
of Theorem 2. However, the left-hand side of (44) is different from that of (32). 


SEMIPARAMETRIC SHIFT ESTIMATION 193 


One difference is that in (32) the risk is normalized by the Fisher information 
I* (f), while in (44) we have its average I° w.r.t. the distribution V, (df). The 
nexi lemma shows that J*( f) is sufficiently close to T°; in particular, its variance 
is small enough. 


LEMMA 11. Ifthe oj’s are such that 


OO OO 
(45) > ,Qx bo + sup og = 0 (^ $ Oroa) 
k=1 k k=1 
then 


| (1°(f) - IY, (df) = (i? nba as & 0. 
k=] 


LEMMA 12. Assume that the density x (x) vanishes at the endpoints of the 
interval © and has finite Fisher information I, = f (n'(x))^n | (x) dx. Then, for 
any f € F, 

I? 
WU a N 

I(f)tl; P) 





inf st |. Eo, [Ge — 02 1* (f)] (0) d0 > 


Proof of this lemma is omitted: this 1s the standard Van Trees inequality for the 
prablem of estimation of 6 with fixed f in model (1) ([33]; see also [7]). 


LEMMA 13. If the sequence (o) satisfies relation (45) and 8 < ./p/2, then 


je : 
J. " (1- m EO) V, (df) < (e Yank 3 +CP(f ¢ FsCf)). 


3.3. From Bayes to minimax bounds. ‘The main idea of the proof of Theorem 2 
is to bound from below the minimax risk by a suitably chosen Bayes risk. In the 
rest of this section we consider the full Bayes model defined in Section 8.2 with a 
special choice of the oz's. Namely, we set 


0, k = Ye We, 
(1—yYe)sg, ^k» yeWe, 
where W, is a solution of (26), s? is defined by (28) and ye = 1/log(e~*) (here 
and later we suppose that £ is small enough, so that y, < 1). To derive the minimax 
lower bound of Theorem 2 from the Bayes bounds of Section 8.2 we need first to 


show that with a probability close to 1 the Gaussian random sequence {fg} belongs 
to zhe set F;( f). In fact, the following result holds. 


(46) of = | 


LEMMA 14. Forany $^ > e2 Weye ^P we have P{ f ¢ Fs(f)} < e CY We, 


194 A. S. DALALYAN, G. K. GOLUBEV AND A.B. TSYBAKOV 


8.4. Proof of Theorem 2. Recall that we consider the full Bayes model with 
the og's chosen according to (46) and A, = og / (e? + of). Note that in this case 


(47) eY Ork =r (1 +01)) — ase 0. 
k=] 


Indeed, (25) and (28) imply that |A,/gq, — 1| < yg for k > yeWe, and hence 
[cf. (29)] 


OO 
e? Y Onk Ak = (1+0(l))e” Y rk)? ae 
k=] k> ys Ws 


= (1+0(1)) (r se > Orka): 
k<ye Ws 
Here [cf. (30) 
k V? EX 
^ E, eaen y. (LY p- 2" T 


kx ye Ws k ys Ws 
Ye 
< cew | (x? + xP*lj dx < Cype W? eor), 
0 


and thus (47) follows. 

Next, we check that if the o;"s are chosen according to (46), then condition 
(45) is satisfied, so that one can apply Lemmas 10-13. In fact, (27) yields W, = 
g ^/CP*1) with B > 1, and using (47), (28) and (30) we get, as € — 0, 


OO 
e^ ) nky Ak = g^ Ww? — 0, 


k=l 
9 oo 
Y'Qzkfyo; xe" Y)  QnO'QW,/ ) 9 « ce*w2y2P 
k=l ye Ws Sk Ws 
OO 
xi G $ Oroa) 
k=1 


OO 
supoz < e*yl? = ofe Yon). 
x k=l 


Now we start the main body of the proof of Theorem 2. First note that, in a 
standard way, conditioning on (xy, xz, k = 1,2,...) and using Jensen's inequality, 
one can easily show that it is sufficient to prove the lower bound of Theorem 2 for 
estimators 6, depending on X* only via (xy, xz, K=1,2,...). Let 7; denote the 


SEMIPARAMETRIC SHIFT ESTIMATION 195 


set of all estimators ó, of 0 measurable with respect to (xy, xr, k= 1,2,...) and 
satisfying the inequalities 


(48) — sup supo, —6) I()]S1--—— and [6,| <1. 
fEFs, (f)8e9 H 


It is enough to restrict our attention to the estimators from 7;, since for estimators 
that do not satisfy one of the inequalities in (48) the lower bound of Theorem 2 is 
evident. 


Clearly, 
(49) sup L Eo, fils — 6)? 1*(f)1x (0) dé < 1-- —— P = VÊ € Te. 
fen, (99 IF’ 
We have 
inf sup Eg, fiÊ— 0) ICA] 
6€7; 9, feFs(f) 
= inf E[(6 — 9 Pf) pn Aad] 
G0) = inf E[(6 — 0? I*15,( 5 ()] sup ERÊ — 6)? (I* — I*C)15 5] 
€Jg GET, 
> inf E[(6 — 0)? I*] — o(e*) — sup E[(6 — 6)*(1* — I*C)15 0] 
GET; 6ETs 
> inf E[(6 — 0) 1*] — o(e*) — sup E[@ — 0) (I* — 1° (f)) Ip AAI, 
c3; 


Where we have used the inequality 


Sup E[(8 — 9 Aj s] < Ce ^ exp( -Cy?W,) = o(e?), 
GET; 


which is a direct consequence of the estimates || < 1, ol < 1/4, IE € Ce ?, 
relation (27) and Lemma 14. The last term in (50) can be represented as 


E[(8 — 6) (I* — I*CF))1 AC) 


E 
ES - [o rm) n 
E(( — 0)^1*Cf) — (1 — I*/I*C))1 cs C). 


Due to Lemmas 13 and 14, the second term on the right-hand side of (51) is as- 
ymptotically negligible with respect to &? Y», Qz k} Ag = r*(1 + o(1)) [cf. (47)]. 


196 A. S. DALALYAN, G. K. GOLUBEV AND A.B. TSYBAKOV 


To evaluate the first term, note that 
EÊ — 6)^ I*Cf) — 11(1 — I*/I*)1 e cs C) 


(52) < sup | i E», siô -6Y IQ) - 1 (0) d6 E(1 — */1* G0) 
ferc 


«Ce? sup | Í. Eg, FÊ — 0) 1*(f)]x (0) do — 1 IEUS — 2500) 7] 


feFiCf) 


It follows from (58) and Lemma 11 that e*E(/® — J*(f))* is o(1). Now, 
Lemma 12, inequality (49) and the fact that SUP rc p, (f) In /I*(f) € Ce? = o(r*) 
[cf. (58)] imply that 


(53) sup | p Eo, (Ge — 6)? 1*(f)] (8) dé — ! «Cr VEG. 


feFs(f) 
Plugging (51)-453) in (50) and using Lemmas 10, 13 and 14, we get 


inf — sup — Ee[( — 0^ I* (/)] z inf EL — 6^1] + or^) 
6€7; 968, f eFs(f) ó 


1 OO 
Bd T 2 Orb +o(r®) 
r* 
WA? 
where for the last equality we have used (47) and the fact that, due to (28) and (46), 
-IFs Y ork og 
k ys We 


«c V  Qnzk) (W,/ k) 9! 
Ve Ws Sk x W, 


< Ce Wyl? = o(1), 


= | + + o(r*), 





as £ — 0, 


9. Proof of Theorem 3. Itis enough to check that Assumptions B and C are 
satisfied for hg = A; and that (34) holds. We first check Assumption C. Recall that 
p-1 


we supposed w.l.o.g. that ys < 1. Then 1 — À% > ye fork > yeW,, and we have 


$.0-ADQzkff?- , 0—ADQzl f? 


k=] k ys We 


OO 
xy PY ap? ky fg x v] Pr’. 
k=l 


SEMIPARAMETRIC SHIFT ESTIMATION 197 


This and (30) show that Assumption C is satisfied for hy = Af. Using (27) we find 
that Assumption B also holds. Indeed, Assumption B2 amounts to checking that 
g? Ww? < C1, which is clearly the case for 8 > 1, whereas Assumption B1 follows 
from the relation //W ,/ log^(e-?) — +00, as e —> 0. Now we are ready to check 
(34). For any x € [0, 1] one obtains [recall that the q&'s are defined by (28)] 


sup RISA sup JOALA Onk k + vw) 


feFs C) ve W(B,L) k=] 
oo 
+ Y (erki) 
k=l 
(54) > 00 
€ (1-- )r* + m 2 0- AR)? Qx b) f? 
ki 
+e? Y 0—-qpQOxEb*, 


k Xy; We 
where for the last inequality we have used (29). From (36) we obtain 
oo 
Ca- R YO Ork? f zCQ WW 7 —o(r). 
k=l k ys Ws 
Note also that due to the relations rê > Ce*W3 [cf. (30)] and ye > 0 we get 


pP 
& M ak- Drk x28 Y (sz) (2x k)* < Ce*yP*? We — o(r*). 
k «y; Ws kXys We d 


These inequalities and (54) prove (34), since « can be arbitrarily small. 
APPENDIX 


PROOF OF LEMMA 10. We start by applying the Van Trees inequality ([33]; 
see also [7 ]): 


A ~ | 
(55) inf EL, — 61» ( Í 3*1) 46 + In) | 
Oe © 


where $*(0) is the Fisher information on 0 contained in the observations (xy, xý, 
k= 1,2,...) for the Bayes model with fixed 0. Since these observations are in- 
dependent, $*(0) is the sum over k of the Fisher information of pairs (xz, xz). So 
using Lemma 9 we get that 4° (6) does not depend on 0 and equals 


JO — €^ Y Qxky (fe Aog) = I — Y (nky As. 
k=] k=1 


198 A. S. DALALYAN, G. K. GOLUBEV AND A.B. TSYBAKOV 
Therefore 
E i eos "OOE 
(56) r(f 4°(0) 6) 8 + In 21+ ) Ork) Àk — = 
k=1 


To complete the proof, it is enough to remark that, in view of (24), 
(57) I > e’ F I =e Qn) p. o 


PROOF OF LEMMA 11. Using the independence of the f,’s for different val- 
ues of k, we get 


ef | (1° (f) - PY wf) = Yank | U2 — Fe — od? We (df) 
k=] 
= VY'Qnk)* (4fgog +202) 
k—l 


OO 
< iur supo + d 
k k=] 
The assertion of the lemma follows now from (45). DO 


PROOF OF LEMMA 13. First note that using Assumption A2 one obtains 


(58) & I*(f) > (20)? f? > Ory (fd — 8%) > (22)*p/2, 
for any f € F3(f) and for any ô < ./p/2. Furthermore, by (24), 
(59) &I'(f)z20f'I^--L)s2(Co-L) Vf EFSF). 


The elementary identity 1 — y = y7! — 1 — y(1 — y~!)* yields 


MODE 


£ 2 fe 
» La g 1) Pa Mon 


To estimate the first integral on the right-hand side we note that 75 = f I*(f) x 
V, (df); therefore using (57) and (59) we get 


I. (Fe a 1) ¥o(af)} = Lock UN 1) ¥o (as) 


< CP(f € Fs(f)), 
where F£ (f) = £7 \ Fs(f). Finally, due to (57) and (58), 


rq) yr "Mp 
Lat I* )) I*(f) V. (df) x Ce fc (f) I*) V, (df). 























SEMIPARAMETRIC SHIFT ESTIMATION 199 
The rest follows from Lemma 11. O 


PROOF OF LEMMA 14. Let nx be iid. M (0, 1) random variables. We have 


P(fé Fs(f)} < i Y qu. d 
k ys Ws 
(60) 





L 
i Y. Onk)” gs; > = 


k> y; Ws € | 
We use Lemma 2 in order to evaluate the second probability. Note that 


2 (27k) Pst < Cet wert! 
k>Ys We 
and max, s; 2Ozk)?? «ce? we d Therefore, by Lemma 2, for any x < C./W, we 
have 
i Y. (2k)? (ng — sz > seran) < exp( -Cx?). 
k> Ys Ws 

À » weAti/2 "- 

pplying this inequality for x = y; L/&^W [note that in view of (27) x is 
less than C./W, ], and using the fact that SW (27k)? s? < Ce? Wz? ya P = 
O(ye), one obtains 


P| Y; Ork P ist > E < P| Y. (2uk)?? (ng — Dsk > a 


k> Yes Ws k>y_ We 


< WR e E 
«e| SIE exp( C Cy Ws). 


The first m on the right-hand side of (60) can hui estimated similarly. 
We have 5». We si < < Ce*Weye *B+3 and IaX-. y W, sz < Ce?ye ee . Hence, 
by Lemma 2, for any x < CV Ys Wes, 


i b (nz — 15? > xg^ ga rem < exp(—Cx?). 
k7 yc We 
So with x = C4/ Ye We, noting that 2 s. sj < < e Waye E , one obtains 


i . ris > | = i Y m- D>- Ye d 


Kk ys We k> ys Ws k>y Ws 


< P| > (n? — 1)52 > coy et 
k>ys Ws 


< exp(—C ys We). E] 


200 


[1] 


[2] 
[3] 


[4] 
[5] 
[6] 
[7] 
[8] 
[9] 
[10] 
[11] 
[12] 
[13] 
[14] 
[15] 


[16] 
[17] 


[18] 
[19] 
[20] 
[21] 
[22] 
[23] 
[24] 


[25] 


A. S. DALALYAN, G. K. GOLUBEV AND AB TSYBAKOV 


REFERENCES 


BELITSER, E. N. and LEVIT, B. YA. (1995) On minimax filtering over ellipsoids. Math 
Methods Statist. 4 259-273. MR1355248 

BICKEL, P J (1982), On adaptive estimation. Ann. Statist. 10 647-671 MR0663424 

BICKEL, P. J., KLAASSEN, C A. J., RITOV, Y and WELLNER, J. A (1998). Efficient and 
Adaptive Estimation for Semiparametric Models. Springer, New York. MR1623559 

CAVALIER, L., GOLUBEV, G. K., PICARD, D. and TSYBAKOV, A B (2002). Oracle inequal- 
ities for inverse problems. Ann. Statist. 30 843-874 MR1922543 

CRAMER, H. and LEADBETTER, M. R (1967). Stationary and Related Stochastic Processes. 
Wiley, New York. MR0217860 

DALALYAN, A. S and KUTOYANTS, YU. A. (2004). On second order minimax estimation of 
invariant density for ergodic diffusion. Statist. Decisions 22 17-41. MR2065989 

GILL, R. D and LEVIT, B. YA. (1995). Applications of the Van Trees inequality. A Bayesian 
Cramér-Rao bound. Bernoulli 1 59-79 MR1354456 

GOLUBEV, G K (1990). On estimation of the time delay of a signal under nuisance parame- 
ters Problems Inform. Transmission 25 173-180 MR1021195 

GOLUBEV, G and HARDLE, W. (2000) Second order minimax estimation 1n partial linear 
models Math. Methods Statist 9 160—175. MR1780752 

GOLUBEV, G. and HARDLE, W. (2002). On adaptive smoothing 1n partial linear models. Math 
Methods Statist. 11 98-117. MR1900975 

GOLUBEV, G. K. and LEVIT, B. YA. (1996) On the second order minimax estimation of 
distribution functions. Math. Methods Statist. 5 1~31. MR1386823 

GOLUBEV, G. K and LEVIT, B. YA. (1996). Asymptotically efficient estimation for analytic 
distributions. Math. Methods Statist 5 357-368. MR1417678 

HALLIN, M. and WERKER, B. J. M. (2003). Semi-parametric efficiency, distribution-freeness 
and invariance. Bernoulli 9 137—165. MR1963675 

HARDLE, W and MARRON, J. S (1990) Semiparametric comparison of regression curves. 
Ann. Statist. 18 63-89. MR1041386 

HARDLE, W. and TSYBAKOV, A B (1993). How sensitive are average derivatives? J. Econo- 
metrics 58 31-48. MR1230979 

HIDA, T. (1980) Brownian Motion. Springer, New York. MR0562914 

IBRAGIMOV, I. A. and HASMINSKII, R. Z. (1981). Statistical Estimation. Asymptotic Theory. 
Springer, New York MR0620321 

LEADBETTER, M. R., LINDGREN, G. and ROOTZÉN, H. (1986). Extremes and Related Prop- 
erties of Random Sequences and Processes. Springer, New York MR0691492 

LE CAM, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York. 
MR0856411 I 

LEVIT, B. YA. (1985). Second-order asymptotic optimality and positive solutions of the 
Schródinger equation. Theory Probab. Appl 30 333—363. MR0792622 

MAMMEN, E. and PARK, B. (1997). Optimal smoothing in adaptive location estimation J 
Statist. Plann Inference 58 333-348 MR1450020 

PFANZAGL, J. (1990). Estimation in Semiparametric Models NLecture-Notes in Statist. 63. 
Springer, New York MR1048589 

PINSKER, M. S. (1980). Optimal filtering of square integrable signals in Gaussian white noise. 
Problems Inform. Transmission 16 120—133 MR0624591 

SCHICK, A. (1986). On asymptotically efficient estimation in semiparametric models Ann. 
Statist 14 1139-1151. MR0856811 

SEVERINI, T. A. and WONG, W. H. (1992). Profile likelihood and conditionally parametric 
models Ann. Statist. 20 1768—1802. MR1193312 


SEMIPARAMETRIC SHIFT ESTIMATION 201 


[26] STEIN, C (1956). Efficient nonparametric testing and estimation. Proc. Thurd Berkeley Symp. 
Math. Statist. Probab, 1 187-195. Univ. California Press, Berkeley. MR0084921 

[27] STONE, C. J (1975). Adaptive maximum likelihood estimators of a location parameter. Ann. 
Statist 3 267-284 MR0362669 

[28] TSYBAKOV, A. B. (1998). Pointwise and sup-norm sharp adaptive estimation of functions on 
the Sobolev classes. Ann Statist. 26 2420—2469. MR1700239 

[29] TSYBAKOV, A. B (2004). Introduction à l'estimation non-paramétrique. Springer, Berlin. 
MR2013911 

[30] VAN DER LAAN, M J (1995). Efficient and Inefficient Estimation in Semiparametric Models. 
CWI, Amsterdam. MR1376813 

[31] VAN DER VAART, A. (1996). Efficient maximum likelihood estimation in semiparametric mix- 
ture models. Ann. Statist. 24 862-878 MR1394993 

[32] VAN DER VAART, A. (1998). Asymptotic Statistics. Cambridge Univ. Press. MR1652247 

[33] VAN TREES, H. L. (1968). Detection, Estimation and Modulation Theory 1. Wiley, New York. 


A S. DALALYAN G. K GOLUBEV 

A B. TSYBAKOV UNIVERSITÉ AIX-MARSEILLE 1 
UNIVERSITÉ PARIS VI 39 RUE F JOLIOT-CURIE 

2 PLACE JUSSIEU, CASE 188 13453 MARSEILLE 

75252 PARIS CEDEX 05 FRANCE 

PRANCE E-MAIL golubev(G?gyptus univ-mrs.fr 


E-MAIL: dalalyan@ccr jussieu fr 
tsybakov G' ccr. jussieu fr 


The Annals of Statistics 

2006, Vol 34, No 1, 202-228 

DOI 10 1214/009053606000000146 

© Institute of Mathematical Smusucs, 2006 


ADAPTIVE CONFIDENCE BALLS! 


By T. TONY CAI AND MARK G. LOW 
University of Pennsylvania 


Adaptive confidence balls are constructed for 1ndividual resolution lev- 
els as well as the entire mean vector ın a multiresolution framework. Finite 
sample lower bounds are given for the minimum expected squared radius for 
confidence balls with a prespecified confidence level. The confidence balls 
are centered on adaptive estimators based on special local block threshold- 
ing rules. The radius is derived from an analysis of the loss of this adaptive 
estimator, In addition adaptive honest confidence balls are constructed which 
have guaranteed coverage probability over all of RY and expected squared 
radius adapting over a maximum range of Besov bodies. 


1. Introduction. A central goal in nonparametric function estimation, and 
one which has been the focus of much attention in the statistics literature, is the 
construction of adaptive estimators. Informally, an adaptive procedure automati- 
cally adjusts to the smoothness properties of the underlying function. A common 
way to evaluate such a procedure is to compute its maximum risk over a collection 
of parameter spaces and to compare these values to the minimax risk over each of 
them. 

It should be stressed that such adaptive estimators do not provide a data- 
dependent estimate of the loss, nor do they immediately yield easily constructed 
adaptive confidence sets. Such confidence sets should have size which adapts to the 
smoothness of the underlying function while maintaining a prespecified coverage 
probability over a given function space. Moreover, it is clearly desirable to center 
such confidence sets on estimators which possess other strong optimality proper- 
ties. In the present paper, a confidence ball is constructed centered on a special 
block thresholding rule which has particularly good spatial adaptivity. The radius 
is built upon good estimates of loss. 

We focus on a sequence of statistical models commonly used in the adaptive 
estimation literature, namely, a multivariate normal model with mean vector cor- 
responding to wavelet coefficients. More specifically, consider the models 


(1) yk =0;k + j=0,1,...,J=1,k=1,.., 2, 


1 
Jn 


Received May 2004, revised April 2005. 
| Supported ın part by NSF Grant DMS-03-06576. 
AMS 2000 subject classifications. Primary 62G99, secondary 62F12, 62F35, 62M99 
Key words and phrases Adaptive confidence balls, Besov body, block thresholding, coverage 
probability, expected squared radius, loss estimation. 


202 


ADAPTIVE CONFIDENCE BALLS 203 


where Z,k ue N(0,1) and where it is assumed that N is a function of n, 
2/ — 1 = N and that the mean vector @ lies in a parameter space @. In the present 
work, confidence balls are constructed over collections of Besov bodies 


Jl 23 1/p\ qx 1/q 
(2) wf 00 - le: (3 (nian) | ) «ul 
k=] 


j=0 


where s = B + j — z > 0 and p > 2. In particular, these spaces contain as spe- 
cial cases a number of traditional smoothness classes such as Sobolev and Hólder 
spaces. Although not needed for the development given in this paper, it may be 
helpful to think of the 0, , as wavelet coefficients of a regression function f. 
A confidence ball for the vector 0 then yields a corresponding confidence ball 
for the regression function f. See, for example, [8], where such an approach is 
taken. Based on the model (1), we introduce new estimates of the loss of block 
thresholding estimators and use these estimates to construct confidence balls. 

In the context of confidence balls, adaptation over a general collection of pa- 
rameter spaces C = {©,:i € I) where J is an index set can be made precise as 
follows. An adaptive confidence ball guarantees a given coverage probability over 
the union of these spaces while simultaneously minimizing the maximum expected 
squared radius over each of the parameter spaces. Write By g for the collection 
of all confidence balls which have coverage probability of at least 1 — over O. 
Write r^(CB, ©) for the maximum expected squared radius of a confidence ball CB 
over © and r2 (©) for the minimax expected squared radius over confidence balls 
in By,@. Then r2(@) is the smallest maximum expected squared radius of con- 
fidence balls with guaranteed coverage over ©. Adaptation over the collection C 
can then be defined as follows. Let ©; = J,.; ©,. A confidence ball CB € Bye, 
is called adaptive over C if for all i € J, r*(CB, 8,) <C, r2(8,) where C, are con- 
stants not depending on n, and we say that adaptation is possible over C if such a 
procedure exists. 

In a multivariate normal setup as given in the model (1) with N — n, Li [11] 
constructs adaptive confidence balls for the mean vector which have a given cov- 
erage over all of IRF. It was shown that under this constraint the squared radius 
of the ball must, with high probability, be bounded from below by cn !/^ for all 
choices of the unknown mean vector. Moreover a confidence ball was constructed 
centered on a shrinkage estimator which attains this lower bound at least for some 
subsets of R”. 

Hoffmann and Lepski [9] introduce the concept of a random normalizing factor 
into the study of nonparametric function estimation and used this idea to construct 
asymptotic confidence balls which adapt over a collection of finitely many para- 
meter spaces. In particular, their results can be used to yield asymptotic confidence 
balls which adapt over a finite number of Sobolev bodies. Baraud [1] is a further 
development of both Li [11] and Hoffman and Lepski [9] concentrating on confi- 


204 T. T. CAI AND M. G. LOW 


dence balls which perform well over a finite family of linear subspaces. An honest 
confidence ball over IR" was constructed such that the radius adapts with high 
probability to a given collection of subspaces. 

Juditsky and Lambert-Lacroix [10] develop adaptive L5 confidence balls for a 
function f in a nonparametric regression setup with equally spaced design. The 
paper used unbiased estimates of risk to construct minimax rate adaptive proce- 
dures over Besov spaces. It focused on the asymptotic performance and detailed 
finite sample results were not given. Robins and van der Vaart [12] use sample 
splitting to divide the construction of the center and radius of a confidence ball 
into independent problems and show how to use estimates of quadratic functionals 
to construct adaptive confidence balls. 

In the present paper the focus is on finite sample properties of adaptive confi- 
dence balls centered on a special local block thresholding estimator known to have 
strong adaptivity under mean integrated squared error. The radius is derived from 
an analysis of the loss of this adaptive estimator. The evaluation of the performance 
of the resulting confidence ball relies on a detailed understanding of the interplay 
between these two estimates. Three cases of interest are considered in detail. We 
first construct confidence balls for the mean vector at individual resolution levels. 
Then adaptive confidence balls are constructed for all N coefficients over Besov 
bodies. Finally we consider honest confidence balls over all of IR" and expected 
squared radius adapting over a maximum range of Besov bodies. 

The paper is organized as follows. Section 2 is focused on constructing confi- 
dence balls for the mean vector associated with a single resolution level j in the 
Gaussian model (1). These confidence balls can be used in a multiresolution study. 
Finite sample lower bounds are given for the expected squared radius of confidence 
balls which have a prescribed minimum coverage level over a given Besov body. 
Bounds are given for the maximum expected squared radius as well as when the 
mean vector is equal to zero. Confidence balls which have an expected squared ra- 
dius within a constant factor of both these lower bounds are constructed. We show 
that the problem is degenerate over a certain range of Besov bodies beyond which 
full adaptation is possible. Adaptive confidence balls are constructed centered on 
a block thresholding estimator. The results and ideas given in this section are used 
as building blocks in the analysis and construction of adaptive confidence balls for 
all N coefficients in Sections 3 and 4. 

The focus of Section 3 is on the construction and analysis of confidence balls 
with a specified minimal coverage probability over a given Besov body Bo a (M). 
It is shown that the possible range of adaptation depends on the relationship 
between the dimension N and the noise level. Adaptive confidence balls are 
constructed over a maximal range of Besov bodies. These results are markedly 
different from the bounds derived for adaptive estimation or adaptive confidence 
intervals. 


ADAPTIVE CONFIDENCE BALLS 205 


In Section 4 confidence balls are constructed which have guaranteed coverage 
probability over all of IR". This procedure has a number of strong optimality prop- 
erties. It adapts over a maximal range of Besov bodies over which honest confi- 
dence balls can adapt. Moreover, given that the confidence ball has a prespecified 
coverage probability over RY, it has maximum expected squared radius within a 
constant factor of the smallest maximum expected squared radius for all Besov 
bodies BP , (M) with B > 0 and M > 1. 

Proofs are given in Section 5. 


2. Adaptive confidence balls for a single resolution level. As mentioned in 
the Introduction, the mean @;,, in the model (1) can be thought of as the kth co- 
efficient at level j in a wavelet expansion of a function f. The different levels j 
allow for a multiresolution analysis where the coefficients with small values of j 
correspond to coarse features and where the coefficients with large values of j cor- 
respond to fine features. In this section we first fix a level j and focus not only on 
estimating the sequence of means at that level but also on constructing honest con- 
fidence balls for this set of coefficients. 

Confidence balls are constructed which maintain coverage no matter the values 
of 0, k and have an expected radius adapting to these coefficients over a range of 
Besov bodies. The analysis given in this section also provides insight (as is shown 
in Sections 3 and 4) into the problem of estimating all the wavelet coefficients 
across different levels. : 

In the following analysis, for a given level j, write 0; for the sequence of mean 
values at this given resolution level. That is, 0; = (0; y :k =1,..., 2J }. The analy- 
sis can then naturally be divided into two parts. We start with lower bounds for the 
expected squared radius of confidence balls which have a given coverage probabil- 
ity over a given Besov body. Two lower bounds are given. One is for the expected 
squared radius when all the coefficients are zero. The other is for the maximum 
expected squared radius. Set zy = ®~!(1 — æ), where ® is the cumulative distrib- 
ution function of a standard Normal random variable. 


THEOREM 1. Fix 0 <a < $ and let CB(8, ra) = (0, : 10} — 8l < ra) be a 
confidence ball for 0, with random radius ry which has a guaranteed coverage 


probability over BS (M) of at least 1 — a. Then for any 0 < € < id — a) 

(3) sup Eg G2 —— min(M*2~*PY, 2^ |. 2/71). 
9€B5 (M) 

Moreover, for any 0 < € < j — a, 

(4 ^ Eo(r2) zi — 2a —26)min(M?2 ?P/, logt (1 + £?)2//?g ^), 

where Eo denotes expectation under 0 = 0. 


206 T. T. CAI AND M G. LOW 


It is useful to note that the maximum value of 5^, 0? , at a given level j 
over the Besov body Bb (M) is M?2-7E. Hence, from (4), if M?^2-7&/ < 
log!/?(1 + £?)2/? n7! the lower bound for the expected squared radius when 
the mean vector is equal to zero is a constant multiple of M?2-7P/, It follows 
that if a given coverage probability is guaranteed over Bb. (M) then the max- 
imum expected squared radius over any other Besov body must also be of this 
same order. It should be stressed that this is really a degenerate case since the 
trivial ball centered at zero with squared radius equal to M^2^7P/ is within a 
constant factor of the lower bounds given in (3) and (4) and has coverage prob- 
ability equal to one. Thus we shall focus only on the construction of confidence 
balls which have a given coverage probability at least over Besov bodies where 
M?2~°P) > log!/*(1 + £?)2/7n-. In particular, we only need to consider resolu- 
tion levels j = j, satisfying 2/ < n? since resolution levels with 2/ > n^ satisfy 
M?27?P] < log!/ 2(1 + ¢*)2//2n~! at least for large n. Moreover, since little is to 
be gained for levels where 27 < logn, by using confidence balls with random ra- 
dius in such cases we shall just use the usual 100(1 — w)% confidence ball centered 
on the observations y; y. Thus in the following construction attention is focused 
on cases where logn < X « n?. 

As mentioned in the Introduction, the center of the ball is constructed by local 
thresholding. Set L = logn and let B ={(j,k):@-DL4+1<k <iL},1l<i< 
27 / L, denote the set of indices of the coefficients in the ith block at level 7. For a 
given block B’, 


SE CX 9 due 2,94 ad x, x S 


Gg, k)e B7 (j,/)e B] (eB! 


Let A, = 6.9368 be the root of the equation A — log à = 5. This threshold is similar 
to the one used in [4, 5]. Then the center ĝ = (6; j,k) 1s defined by 
(6) Ô, k = yj: ISẸ, > ALn’). 

It follows from [5] that this local block thresholding rule has strong E 
under both global and local risk measures. We now show how the loss (6, — Fake l3 
of this estimator can be estimated and used in the construction of the radius of the 
confidence ball. Note that 0, , equals either O or yj, and hence the loss can be 
broken into two terms, 


370,4 —0,40* = 3582, (SS, < AL) 
k 
7) | —1.2 2 —] 
+ n X; 41(55, > A«Ln "). 
H 


The first term can be handled by using an estimate of a quadratic functional. The 
other term can be analyzed using the fact that jf , has a central chi-squared distri- 
bution. 


ADAPTIVE CONFIDENCE BALLS 207 


Let (x)4 denote max(0, x) and set 


2 
r2 = 210g!” (=) -+ Ar n7 


(8) T (Det, — Ln (s < 2x) 
I = 


+ QA, + 8A}? — 1)Ln7! Card(i : 57, > A4Ln 1). 
The confidence ball is then defined as 
(9) CB(6;, ra) = (6, : 18, — 9, ll2 < ra) 


where, when 2/ > logn, the center Ü ; is given as in (6) and the radius given in (8) 
and where 6 3 = y,,k and rg is the radius of the usual 100(1 — a@)% confidence ball 
when 2/ « logn. 


THEOREM 2. Let the confidence ball CB,(6, ry) be given as in (9) and sup- 
pose that the resolution level j satisfies 2/ < n^. Then 


(10) inf P(0, € CB,(0, ra))  1— æ —2(logn) !, 
6ERN 
and for a constant Cg depending only on p, 


2 
sup E(r2) < |2 log! (2) 4A zu + 4 pia 
(11) 8e BP (M) 2 


+ Cg min(2/ n^, M?2~7FJ), 


Note that the confidence ball constructed above attains the minimax lower 
bound given in (3) simultaneously over all Besov bodies B^ , (M) with M?2~24/ > 
log!/*(1 + €2)2//?n-!. This is true even though the confidence ball has a given 
level of coverage for all 6 in R”. 


3. Adaptive confidence balls over Besov bodies. The confidence balls con- 
structed in Section 2 focused on a given resolution level. In this section this 
construction is extended to the more complicated case of estimating all N coeffi- 
cients of 0. Specifically, we consider adaptation over a collection of Besov bodies 
B5 (M) with p > 2. It should be stressed that the theory developed in this sec- 
tion for adaptive confidence balls is quite different from that of adaptive estimation 
theory where adaptation under global losses is possible over all Besov bodies. In 
particular, adaptation for confidence balls is only possible over a much smaller 
range of Besov bodies. 


208 T. T. CAI AND M G. LOW 


In Section 3.1 a lower bound is given on both the maximum and the minimum 
expected squared radius for any confidence ball with a particular coverage prob- 
ability over a Besov body. As in Section 2, these lower bounds provide a funda- 
mental limit to the range of Besov bodies where adaptation is possible. Adaptive 
confidence balls are described in Section 3.2. They build on the construction given 
in Section 2. The center uses the special local block thresholding rule used in Sec- 
tion 2 up to a particular level and then estimates the remaining coordinates by 
zero. The radius is chosen based on an estimate of the loss of this block thresh- 
olding estimate. The analysis of the resulting confidence ball relies on a detailed 
understanding of the interplay between these two estimates. 


3.1. Lower bounds. Theorem 1 provides lower bounds for the expected 
squared radius of a confidence ball for the mean vector at a given resolution level 
with a given coverage over B$ q (M). In this section lower bounds are given for the 
expected squared radius for the whole mean vector for any confidence ball which 
has a given coverage probability over BË (M ). There are two lower bounds, one 
for the maximum expected squared radius and one for the minimum expected 
squared radius. We shall show that these two lower bounds determine the range 
over which adaptation is possible. 


THEOREM 3. Fix0 «a < i and let CB(8, ra) = (0:10 —8l; < ra} beal—a 
level confidence ball for 0 € BÊ (M ) with random radius ry. Then 
sup Eo (rZ) 
(12) 9€B5 ,(M) 
> — - Sii min(Nn^!, c, 4/0320 yp 0420 ,728/0-28)) 
For any 0 <€ < 3 — a, set y = log(1 + e?). For 0 < M' <M set 


be = min(2- /Q(-48)—1,, B/0-H4B) (y — Mh)LOAB y 728/0348). 


(13) 
Ly UA NTAS-12). 


Then for all 0 € Bb (M^), 


(14) Po (Ta > be) > 1 — 2a — 2e 

and consequently 

(15) inf  Eg(r) > (1 — 2a — 2e) b%. 
6€ B , (M) 


In fact, as is shown in the next section, both bounds are rate sharp in the sense 
that there are confidence balls with a given coverage probability over BP, (M) 


ADAPTIVE CONFIDENCE BALLS 209 


which have expected squared radius within a constant factor of the lower bounds 
given in (12) and (15). There are two cases of interest, namely, when N > n? and 
N < n?. First suppose that N > n? and fix a Besov body Be, (M) over which it is 
assumed that the confidence ball has a given coverage probability. Then by (15) the 
minimum expected squared radius is at least of order n~4°/('+4P), Since from (12) 
the minimax expected squared radius for confidence balls over B5, à (M) is of or- 


der n -?*/0-*?9), the confidence ball CB(6, r) must have expected squared radius 
larger than the minimax expected squared radius over any Besov body By, ,M ) 
whenever t > 2f and p’ > 2. Hence in this case it is impossible to adapt over any 
Besov body with smoothness index t > 28. Consequently in this case there is a 
maximum range of Besov bodies over which full adaptation is possible. 

Now suppose that N < n^ and that N x n? where 0 < p < 2. In this case the 
possible range of adaptation depends on the value of p. Let CB(6,r) be a confi- 
dence ball with guaranteed coverage probability over Bb, (M). First suppose that 
px i -— ie Then as above it is easy to check that the minimum expected squared 


radius is at least of order n~*?/C+4P) and that it is impossible to adapt over Besov 
bodies with t > 28. On the other hand, suppose that f < 25 — i Then by (15), the 


minimum expected squared radius is at least of order n°/*—!, which is the mini- 
max rate of convergence for the squared radius over a Besov body with B = r — 7 
Hence in this case it is impossible to adapt over any Besov body with smoothness 
index t > i— i. 

In summary, for a confidence ball with a prespecified coverage probability over 
a Besov body BE (M) the maximum range of Besov bodies B5, d (M) over which 


full adaptation is possible is given in Table 1. 


3.2. Construction of adaptive confidence balls. In this section the focus is on 
confidence balls which have a given minimal coverage over a particular Besov 
body. Subject to this constraint, confidence balls are constructed which have ex- 
pected squared radius adapting across a range of Besov bodies. The resulting balls 
are shown to be adaptive over the maximal range of Besov bodies given in Ta- 
ble 1 for the first two cases summarized in the table. The third case is covered in 
Section 4. 


N =n? forü c p «2,0 « B x y; I Bsrxi-i 


TABLE 1 
N, n and f Maximum range of adaptation 
N 7» n^,all B>0 B<1t <2 
N =n? for0<p<2,B> 5-4 p< <2B 


210 T. T. CAI AND M. G. LOW 


The ball is centered on a local thresholding rule and the squared radius is based 
on an analysis of the loss of this thresholding rule. More specifically, for the cen- 
ter 0, let J; be the largest integer satisfying 


(16) avi < min(N, MAINEAR) p INUEABIY 


For all j > Jı, set 0, = 0 and for j < J; — 1 let Ó, , be the local thresholding 
estimator given in (6). The radius is found by analyzing the loss 


J-1 XY h-t 2! JI-1 X 
(17) 252,6,4-6,0^ = 9, 2, 9,07 9, VIO. 
j=0k=1 J=0 k=l J=J1 k=l 


The first of these terms is bones Ead to that used in (7) and (8). The sec- 
ond component in the loss 7^ = LA PRO g? jk isa ea — It can be 
estimated well by using an unbiased estimate of x : P: , where J2 is the 
largest integer "^w 2^ < min(N, M*/ CETT, (H then bounding 
the tail >=), 377. 0? , from above. 

More aes ed iet the squared radius 


r2 = M21 AD 48/448) 


J1—1 
+5 (Det, -Ln (s^ < x) 
y=O0 \ı T 
(18) 1/2 mae 2 -1 
+ (Ay 8A/^—DLn ^ $5 SUIS), > A4Ln7) 
J—0 1 
hl Y 
t 3,07. =n’), 
jad, k=l 


where 


Ca = P (1 — 277)! + 2log! (=) 
E |l: lo^ (2) + Zaja : 2272412 (1 — 2-2 Ab) 
4 25/4: 20*1 (1 — pen 
x AE AEN EE, 


Note that the last term in cg tends to 0 as n — oo or M > co. 


ADAPTIVE CONFIDENCE BALLS 211 
The following theorem shows that the confidence ball CB* defined by 
(19) CB* = {0 : [0 — 612 < ra) 
has adaptive radius and desired coverage probability. 


THEOREM 4. Fix0 <q < 5 and let the confidence ball CB* be given as 
in (19). Then, for any t > B, 


inf | P(0 c CB*) 
0€B5 ,(M) 
CO ^ >a) 
= [n ! ae 3(1 E Doo EEE) Ml ome ee ee, 

For x < 2f, 
(21) sup E(rg) < Cr min(M"/ OT nA 0422), Nn!) 

6€Br (M) 
and for t > 2f, 
(22) sup E(r2) < Cg min(M2/ (Dg 48/048). Nn!) 

6€Br (M) 


where C, and Cg are constants depending only on t and B, respectively. 


Theorem 4 taken together with Theorem 3 shows that the confidence ball CB* 
is adaptive over a maximal range of Besov bodies 


Q3) C = {B} (M): x e [B 28], p 2,4 > 1) 


when either N > n^ or N — n?,0 < p «2 and p> T — i. In addition, the results 
also show that the confidence ball CB* still has guaranteed coverage over B5 (M) 
for t > 2f although the maximum expected radius is necessarily inflated. 


4. Adaptive confidence balls with coverage over RY. In Section 3 it was as- 
sumed that the mean vector belongs to a Besov body B^, (M) and the confidence 
ball was constructed to ensure that it had a prespecified coverage probability over 
that Besov body. Under this constraint there are two situations where the confi- 
dence ball has expected squared radius that adapts over the Besov bodies By. o (M) 
with t between £ and 2£, namely, when N > n? or when N =n? with 0 < p <2 
and p > 5 — i. In both cases this is the largest range over which adaptation is 
possible. 

We now turn to a construction of "honest" confidence balls which have guaran- 
teed coverage over all of R”. For the case when N = n, such “honest” confidence 
balls, those with a guaranteed coverage probability over all of R^, was a topic 
pioneered in [11]. See also [2] and [3]. Li [11] was the first to show, when N — n, 


212 T. T. CAI AND M. G. LOW 


that any "honest" confidence ball must have a minimum expected squared radius 
of order n^ /?, In fact, using the lower bounds in Theorem 1 for the level-by-level 
case, it is easy to see that for any confidence interval with coverage over all of RY 
the random radius must in general satisfy 


1 — 2 — 26 
(24) Eo(r2) > —=—= (logit + 62). Nn, 


Once again, for the case when N = n, Li [11] also showed how to construct 
“honest” confidence balls with maximum expected squared radius of order n^ !/? 
over a parameter space where a linear estimator can be constructed with maximum 
risk of order n !/?. Such estimators exist when the parameter space only consists 
of sufficiently smooth functions. In particular, for the Besov bodies B^, (M) with 
p > 2 Donoho and Johnstone [7] showed that the minimax linear risk is of order 
n 28/028) and the methodology of Li [11] then leads to “honest” confidence 
balls with maximum expected squared radius converging at a rate of n^ !/^ over 
Besov bodies Bg (M) if B > $ and p > 2. However this approach is not adaptive 
over Besov bodies B5 (M ) with 8 < i. 

In this section “honest” confidence balls are constructed over IR" which simul- 
taneously adapt over a maximal range of Besov bodies. Attention is focused on the 
case where N < n? since, from (24), if N > n?, the minimum expected squared 
radius of such "honest" confidence balls does not even converge to zero. 

The confidence ball is built by applying the single level construction given in 
section 2 level by level. In particular, the center of the confidence ball 1s obtained 
by block thresholding all the observations in blocks of size L — logn. For each 
index (j, k) in the block, say, B? the estimate of 0, , is given by 


(25) 0, k = yj: IS, > ALn!) 


jb es 


where A, — 6.0368. The center of the confidence ball Ó is then defined by 
0 = (0, x). The construction of the radius is once again based on an analysis of 
the loss là — 0 3 and applies the same technique as that given in Section 2. Set 


2 
r2 = [210g"2(2) -+ AM Nn 


J—1 
(26) 23 (X, -InI (S2, < 2) 
j=l v 1 


+ 
+ QA, + 8A,/? — D) Ln^! Card(i : S? ; > A4Ln 1). 

With 6 given in (25) and rg given in (26) the confidence ball is then defined by 

(27) CB,(6, ra) = (8:16 — ll < ra}. 


ADAPTIVE CONFIDENCE BALLS 213 
THEOREM 5. Let the confidence ball CB, (ô , Ta) be given as in (27). Then 


(28) inf P(6 € CB,(0,r)) > 1 — a —2(ogn) ! 
GERN 
and, if M > 1, 


2 
P.4 


+C min(Nn-!, MA O20, 72:022). 


where C, > 0 is a constant depending only ont. 


Itis also interesting to understand Theorem 5 from an asymptotic point of view. 
Fix 0 < p < 2 and let N = n^. It then follows from Theorem 5 that the confidence 
ball constructed above has adaptive squared radius over Besov bodies By. (M) 
with t < 4 — 1 —1/2 


2745 and has maximum expected squared radius of order n OVer 


Besov bodies with t > 2 — $. Note that the range depends on N. In particular, 


consider the special case of N = n. In this case, note that for t < : and M > Lit 
follows that 


(30) sup E(r2) <C; min(1, MAEA p EEEN 
0€B5 qM) 


and hence, although the confidence ball CB, depends only on n and the confidence 
level, it adapts over the collection of all Besov bodies BS (M) with B < LA 


G1) e—[B5,(M)::0«B xi, p»2,421M 21). 


This 1s the maximal range of Besov bodies over which honest confidence balls 
can adapt. In addition, it follows from (29) that the confidence ball has maximum 
expected squared radius within a constant factor of the smallest maximum ex- 
pected squared radius for all Besov bodies BÊ (M ) with B > 0 and M > 1 among 
all confidence balls which have a prespecified coverage probability over R”. 


S. Proofs. In this section proofs of the main theorems are given except for 
Theorem 2. The proof of Theorem 2 is analogous although slightly easier than that 
given for Theorem 4. 


5.1. Proof of Theorems 1 and 3. Theorems 1 and 3 give lower bounds for the 
squared radius of the confidence balls. A unified proof of these two theorems can 
be given. We begin with a lemma on the minimax risk over a hypercube. 


214 T. T. CAI AND M. G. LOW 


LEMMA 1. Suppose y, = 0, +02, Zi LS N(0,1) and i =1,...,m. Let 
a 7 O, and set Cm(a) = (0 € R":0, = a,i = 1,...,m]. Let the loss function 
be 


(32) LÔ, 0) =} '1(6 — 6,17 a). 
i=] 


Then the minimax risk over Cm (a) satisfies 


inf sup E(L(6,6))=inf sup } P(Ó; C6,| 7 a) 


^ 


(33) Ô 0ECm(a) Ô 0€Cm(a) =] 


(Ds 


where ®(-) is the cumulative distribution function for the standard Normal distri- 
bution. 


PROOF. Let m,,i=1,...,m, be independent with 7, (a) = z;(—a) = 1. Let 
mw = | |; m, be the product prior on 0 € C5 (a). The posterior distribution of 0 
given y can be easily calculated as Pg, (0) = | [7-4 Po, y, (01) where 


2ay, /a? 


Pa, |y, (6,) = - 1(0; =a) + I (0; = —a). 


14 e2ay,/o7 1424/0? 


The Bayes estimator ó" under the prior T and loss L(-, -) given in (32) is then the 
minimizer of Eoy L(0,0) = » 7". Poj(|B, — 0,| = a). A solution is then given by 
the simple rule 2d =a if y, > 0, gr = —a if y, < 0. The risk of the Bayes rule 67 
equals 


m ^ 
Y. Pp (167 — 6,| a) 


i—1 


(34) = [Epo < 0/8, =a) + È P»; > 016, = -a)} 


Since the risk of the Bayes rule 6” is a constant, it equals the minimax risk. LU 


The proofs of Theorems 1 and 3 are also based on a bound on the £L dis- 
tance between a multivariate normal distribution with mean 0 and a mixture of 
normal distributions with means supported on the union of vertices of a collection 
of hyperrectangles. Let C (a, k) be the set of N-dimensional vectors of which the 
first k coordinates are equal to a or —a and the remaining coordinates are equal 


ADAPTIVE CONFIDENCE BALLS 215 


to 0. Then Card(C (a, k)) = 2*. Let Pj be the mixture of Normal distributions with 
mean supported over C (a, k), 


1 
(35) P= 2, Suvi 
0cC(a,k) 
where ®g , w is the Normal distribution N (0, o% Iy). Denote by ġo oy the density 
of $6.5 y and set Po = 9.1, i N- 


LEMMA 2. Fix0 «e « Land suppose ka^n? < log(1 + £?). Then 


(36) L1(Po, Py) x €. 
In particular, if A is any event such that Po(A) > a, then 
(37) P(A) > a-e, 


where PX is the mixture of Normal distributions given in (35). 


PROOF. The chi-squared distance between the distributions Py, and Pp = 


2 
Qo 1/ n,n Satisfies f x* < etn? <1 +e? and consequently the L distance be- 
tween Po and P, satisfies 


P2 112 
Li(Po, Px) = f Po - and < (Fg -1) <E. 
0 


Hence, if Po(A) > a, then P(A) > Po(A) — L1(Po, Pk) > œ — e and the lemma 
follews. L1 


PROOF OF THEOREMS 1 AND 3. We first prove the bound (3). Fix a constant 
e satisfying 0 < e < ¿($ — æ) and note that z442, > 0. Take m 227,0 —n 7? 
and a = min(zg.12,n ^ 1/27, M2~/@+!/2)) in Lemma 1 and let C, (a) be defined as 
in Lemma 1. Then every N-dimensional vector with the jth level coordinates 6, 


in C,,(a) and other coordinates equal to zero is contained in Bb e (M). It then 
follaws from Lemma 1 that 


dis m 
inf sup  P(|Ójk—8,4| za) zinf sup 9 P(|8jx —6j,4| >) 
(38) 9 geBÉ (My k=! Ô 6,€Cm(a) kal 


> (a + 2e)m. 


For any Ó, set Xo = Yt 1 (jÓ,. — 0; k| > a). Then Xp < m. Let y = ecu 
Then 





(a+2e)m< sup E(Xg)x sup {ymP(X9 «ym)--mP(Xo = ym). 
6c BP (M) 6c B5 (M) 


216 T. T CAI AND M. G. LOW 
It follows that SUP gc pe (ap) P(Xg > ym) >a + £ and consequently 
pq 


(39 sup P(]0,—0,l22 yma?) » sup P(Xg>ym)>ate. 
6c Bb (M) BEBÉ (M) 


Suppose CB(6, rg) = {8; : |8, — 6; lo € re] isa 1 — « level confidence ball over 


B5, 4 (M). Then inf,- pany Os 6|2 < r2) > 1 — æ and hence 


sup P(r2 > yma’) > sup P(yma* «Jg p= Ô, IE < r2) 
0c B5 (M) 6c B5 (M) 
>atetl—a-—l=e. 
Thus for any e satisfying 0 < € < iG — a), SUP cB (M) E(r2) > eyma?, which 
completes the proof of (3). The proof of (12) is quite similar. Let j’ be the 
largest integer satisfying 2) < min(N, (1 — 274(6*1/2)2/( go M AE ) x 
M2/(1+28)y1/042P)) | Equation (12) in Theorem 3 follows from Lemma 1 by tak- 
ing m — 27. o —n i" and a = Zaten 1/2, 
We now turn to the uo of (4) and (15). For (4) apply Lemma 2 with k = 2/ and 
a = min(M2-/*1/2, ,1/^45—1/^s—1/^). It js easy to check by using the first term 
in the minimum that 2/52//?a < M. Hence the sequence which is equal to a or —a 
on the jth level and otherwise zero satisfies the Besov constraint (2). Moreover, 


using the second term in the minimum, it is clear that ka^n? « y. For (15) the 
above remarks hold with j replaced by J and it is clear that the collection C (a, k) 


of all such sequences is contained in Be, (M). It then follows from Lemma 2 that, 
for Py defined by (35), Lı (Po, Pk) < € and so 


(40) P;(0 € CB(8, r)) >1—a—e. 


Now since for all 0 € C (a, k), P(0 e CB(6,r4)) > 1 — o and hence P ({C (a, k) A 
CB(6, ra) Æ Ø}) > 1 —o,1t follows that 


(41) Py ({C (a, k) à CB(8, ra) £ B}) 2 1— o. 

The Bonferroni inequality applied to equations (40) and (41) then yields 
(42) P (0 € CB(6, ra) N{C(a, k) N CB(8, ra) # 2]) = 1 - 2a — e. 
Once again, since L;(Po, Py) < e it follows that 

(43) Po(0 € CB(8, ra) N{C(a, k) N CBO, ra) 4 Ø}) > 1— 2a — 2e. 


Now note that for all 6 € C(a, X), ||9||; = ak! = 2b,. Hence, if CB(6, ra) con- 
tains both 0 and some point 0 € C (a, k), it follows that the radius ry > 18 |o = bs 
and consequently 


Po(ra > be) > Po(0 € CB(8, ra) N {C (a, K) N CB(8, ra) £ Ø}) > 1 — 2a — 2e. 
" 


ADAPTIVE CONFIDENCE BALLS 217 


5.2. Proof of Theorem 4. The proof of Theorem 4 is involved. We first collect 
in the following lemmas some preparatory results on the tails of chi-squared dis- 
tributions and Besov bodies. The proofs of these lemmas are straightforward and 
is thus omitted here. See [6] for detailed proofs. 


LEMMA 3. Let Xm bea random variable having a central chi-squared distri- 
bution with m degrees of freedom. If d > Q, then 
(44) P(Xm > (1-- d)m) < 36 (m/2)(d—logli+d)) 
and consequently P(Xm > (1+d)m) < 1e-0/0d m(1/04^m If 0 « d « 1, then 
(45) P(Xm < (1 — dym) < e (0/94m. 


LEMMA 4. Let y, 20, +oz;,i=1,2,...,L, z ^4 N(O,1) and let 4, = 


6.9368 be the constant satisfying  — log X — 5. 


Gi) Fort > 0 let à; > 1 denote the constant satisfying À —logA = 1 + ED If 
L 102 < (A, — A x). Lo?, then 


L L 
(46) aS yz e < p( doa > set) ee 


i=l isl 





(ii) If 9; - 02 > AA, Lo", then 


L L 
(47) (Sox < ito?) < (dot > st < le. 


1=1] i=] 
LEMMA 5. (i) Forany@é€ B5 M) and any 0 «m « J — 1, 


J-1 2 
(48) > 367, 5-277) M72 
j=mk=1 


(ii) For a constant a > 0, set £ = ((j, D): WeB’ 65, > aLn |). Then for 
p22 


(49) sup Card(4) < DL 1 MY O+) p /0 F20) 
8EB? (M) 


where D is a constant depending only on a and t. In particular, D can be taken 
as D = 3(1 _ 2-21)-1/(0-2:) 9 - 170-723) 


218 T. T. CAI AND M G. LOW 


PROOF OF THEOREM 4. The proof is naturally divided into two parts: ex- 
pected squared radius and the coverage probability. First recall the notation that 
for a given block BJ i 


2 2 ^p 2 d. unt 2 
SU 2, Jp Bj, = D 0k and X= 2 ej 
(G,k)e B] (j,k)e B] (Gg, 5)e B] 


We begin with the expected squared radius. Let t > 8 and suppose 0 € By, q (M). 
From (18) we have 


Eo (r2) = c, MOB) p 48/034) 


Jı—1 
+ Eo (Xs. —Ln (S7, < 2x) 
J=0 zs 


I 


Ji—1 
(50) + QA 8117 — 1)Ln! 3 V Po (S55 > Ag Ln") 

J=0 1 

Jo9—1 27 

2 
lr a 2.0 
j=]; kl 
= G1 + G2 + G3 + G4. 


We begin with the term G3. Let A, be defined as in Lemma 4 and set 


(51) £j -(G,D:jA-LE > (VA — VA.) Ln |] 
and 
(52) b-(G.D:j A-1823; S (VA. — VA) Ln |]. 
It then follows from Lemmas 4 and 5 that 

J1i—1 

>> S. PG^ An )m Y P(S5 > ALn’) 

J=0 i (eti 

+ J PG? -A.Ln |) 
(Ed 


< Card(41) + lp 12^ f 5g 7/042) 


< min(L 12^, DLT? M2/0-2041/0-20) 
4 E n A020 


ADAPTIVE CONFIDENCE BALLS 219 


for some constant D depending only on t. Note that 27! = min(N, M2/U+2A) x 
nV 285) and so 


JU 


G3 = QA, + 8117 — 1)Ln7! V^ Y Po(S?, > Ay Ln) 
y=O0 1 
(53) « C min(Nn^!, M?/0-20)4—2B/0-2B) M2/(1+22) a 


$ C min(Nn™!, M?/O +22) ,BIO2B)) . n ^02) 
« C min(Nn^!, MSD ys ERA) 


The term G4 is easy to bound. When N < M7/(U+2A),1/U+28) Ji = Jo and hence 
G4 — 0. When N > M?/(-*28) 41-28). it follows from (48) in Lemma 5 that 


Jz—1 2/ 


Ga= 5 3.8) 


j=J, k=1 
< (1 n" DANOM A ee 


<C min(Nn ^! M Hber coe ope. 


(54) 


We now turn to G2. Let J, be the largest integer satisfying 2^* < min(N, 
M?/Q-21) 4 V(O-21)). Write 


Je-1 


G,- Y. E(t, -Ln (s < Mah) 
j=0 I "t 


Ji-1 
m ess? — Ln DIS, 2) 
s 


]7Jt i 
= 65 + G, 


where G22 = 0 when J} = Jj. Note that 


Jel 
Gu = >> (0%, —-Ln (S5, < stn) 
j=0 i + 
Je—1 
(55) < X Y'0.- DLn 
J=0 1 


< (àx — Din t2" LT! 


< (X) min(Nn™}, MA N: 


220 T. T. CAI AND M. G. LOW 


When N < M?/(-*2041/(-22) f. J, and so Ga = 0. On the other hand, when 
Je < Ji, 


Ji—-1 
G2, — Y E(t, -Ln (S; < 2 


Teak i T 


Ji—-1 
<5 Vip tx) 
mf 


j=J; t 


a 1 2, 1/2 
Jp i 
J1i-1 


2411/2 
<>} ‘Sheree 4+-2n722) + (xs) | 


jody 


JA-1 1/2 Ji~] 
Sin y (xe) F2 bx "DOR r 
i 


jody j=J; J= szjy H 
Note that >, £2; = 572.407, < M?2-?". It then follows that 


Go < rti (1 M Qty tyre) a Gas) 
+ 4M V OFP) p AHB) (2448) 
E 2?t (1 = Quy erate) 


and so 
(56) Go < C min(Nn^!, M?/ 20), 2r A20), 
This together with (50) and (53)—(55) yields 


sup Eor < sup (Gi Goi-- Gz2 - Ga - G4) 
9EB? (M) 8c Bt (M) 


< cy min(Nn7!, M45, 748/0-45) 
+ C, min(Nn- |, MAI (1422) q~24/(1422)) 


where C, is a constant depending only on t. For 0 < t < f similar arguments 
yield 


sup Eg (r2) <C min(Nn™!, ME ice Are 
JEBI (M) 


ADAPTIVE CONFIDENCE BALLS 221 


We now turn to the coverage probability. Set C(@) = P(là — ol > r2) and fix 
t > B. We want to bound SUP 9 EBs (M) C(@). Note that 


J1-1 
là —6l5— $ J 57,I(57; x ALn!) 
j=0 1 
JA1-1 J-i 2! 
n 3 Y x? I3, ALn D) + 35 50%. 
J=0 1 jas, Km] 


It fellows from (48) in Lemma 5 that 


Jel 2 
2 —At4—1 ag2a—2t J 
sup 5 > 06,0-2^7) M2? 
(57) 0€ Bp (M) j=h k=l á 


< 22B(1 — 272P y2/ 0B) ,—B/EAB) 
Set ao = 261 — 27?8)-1, ay = zu - 2924" (q — 2 328)/05 x 
MI/QOX28)2/0-48),1/Q4B)-1/0-HMB) a, = 2log! (4) M1/0+28)—-2/1+48) x 
n CTAR EAE), uiuo ac DPF o DR Me A ARA ee 


n!/C+48)-1/0+48), a, = 21og!/? ($) and as = 2A» + 834^ — 1. Then Ca in (18) 


equals ao + a; d- a2 +a3 4- a4 and the squared radius r2 given in (18) can be written 
as 


r2 = (ag + ai + a2 + az + a4) M2/ 1448) y 74B/O-HB) 


Ji~} 
+5 (Dot, -Ln (S^ x eant) 
170 T 


i 


Ji-1 Jg—1 X 
+asLn™ Y VIG?» AEn) + 35 Y on’). 
j=0 1 gesda-kes] 


Set 43 = (UG, i): j < Jı — 1,87, Z 4A4Ln ^) and 44 = {G i): j S Ji - LE], < 
4X. Ln^ |). It then follows that 


C(8) < P| YER L087; S ALn n71x? ICS; > ALD] 


(,1)€43 
> Y [(7; — Ln I(S7; < A«Ln ) 
(U.)€t5 


tasLn (S7, > Macy 


222 T. T. CAI AND M. G. LOW 


+ P| SO EFAS), SALn D) n x77; A Ln 7))] 
Gel, 


+ So IG LDS; < ALn’) 
(nies 


2n asLn !I(S5. > any 


h- VY 
" P( Y^ 3262, > (a3 + a) MOD, 48/1440) 
jJ, k=l 


h-i 27 

Ex 2-004 =) 
J=J k=l 

=h) +h + 73. 


We shall consider the three terms separately. We first calculate the term 7;. Note 
that 


T, < P| SO (S^; - Ez, - Ln )I(S*, <A-Ln™") < | 
G,Dets 


+P] 3, nS, > dln) > Y asa Gd, Lat) 
(J,)e43 (j,1)e15 


< Y P(Sj «An e M) P(x, > asL). 
Gets (J,i)e€43 


It follows from Lemma 4i) that P(S?, < A.Ln ^!) < P(x}, > AsL) < 4n? for 
(j,i) € £3. Lemma 5 now yields 


T, < n ? . Card(43) 
(58) < 3(1 = g-2ry Tree) (AA / 0-121) PIMAT) p 2/22) 
« 3(1 — 2725y-1/0-20) p -1 yg2/ 0428) 428/0428). 
We now turn to the second term 75. Note that 
T,—P| Y. [(S2; — £7; — Ln) c asLn ^ I(S?, > A4Ln^)] 


(,1)€44 
< — (01 + a) M2/1+48) 7451045) 


ADAPTIVE CONFIDENCE BALLS 223 


+ > GLenjxh-8 -Ln G Macy 
Ged, 


< P| 32 (87, — E; — Ln) < —(ai a2) M^ b Rachael 
(7 tel, 


i P| Y, (S82, en 1x2, — 82, — Ln 152, > A4Ln |) 
(J,1)e44 


> > asLn (S7, > ant) 
(Q,))€ef4 
= 151 + 729. 


For any given block, write 


js = > (0) k n7 "ez, y 
(,K)e B/ 


= EF, tam *^ Op kek tO x, 
(, k)e B/ 


= FF Wn PE Zi XG 


where Z,, — E71 E (ep! 912), is a standard Normal variable. Then 


I5 = P| > (S^; — E^ — Ln |) < —(dj Fa) MO HD 440 | 
GEL, 


< P| > Ol Y ny ae tax? m in’) 
(J,i)eta 


< este + cay? RM] 


< pax” » Eji Zji <—ay mia rasan] 
(,i)efa4 


+ P| YO xi < a M*AUTSDSVOB) 4 Cai). 
Gets 
= hn + Thi. 


224 T. T CAT AND M. G LOW 


Note that, for any 0 < j' < J; — 1, 


J1—1 
3); S; sL M Ahn + Y MA 
nel, J=} 


—41,2/ n+ M^?(1 TETIT, 


Minimizing the right-hand side yields that Dy, yeg, £7, < 2(42,)?/0 t?) (1 — 
221 )-1/0-£2:) 2/020 421/21) = 8.0 = 2728) 1/428) 42/1428) x 
n—*b/U+2B) Denote by Z a standard Normal random variable. It then follows 
that 7514 = P(Z < — 3a M?/( Dg ATIS eg, $50 Us PZ« 
Za/4) = $. Now consider the term 7212. If Card(44)L < a? M? OMP) g1ATTAB), 
then T212 = 0. Now suppose Card(44)L > a; M?/(0--4B),1/(-748). [t follows 
from (45) in Lemma 3 by taking m = Card(44)L < 27! and d = a3 M"/ U^) x 
nV 05 fm that Tj; < exp(— 1a2 MA/(-45)2/0-25,2/048)-1/0-428)) — & 
and hence 


Of 
(59) Ta = hu + T2912 < 2' 


We now consider the term 722. Simple algebra yields that 


T5 P( Y, (Stn yg, E, — Ln )I(S2, > Agen") 
Get, 


> >. asLn !I(S*, > 22) 


(,i)eda4 
< JO P(Žji > 1. (as 24 + Ln”) 
Get, 
+ $. P(x% >L). 
(1) €44 


Note that E? ;<4, Ln! for ( J, i) € £4. Hence it follows from the bounds on the 
tail probability of standard Normal and central chi-squared distributions that 


T? < > P(Z,, > 2(log n)!/*) + 5 In^ 
(60) (,i)e4 (G Dela 
< LAM O28, 28/0428) 4-1. 


We now turn to the third term 73. Note that Yk = 0 + 2n- 76, kZj k + n 


ADAPTIVE CONFIDENCE BALLS 225 


and so 
J2—1 2! 
n- P(2 c 3 Y 6j. 
jzJi k=l 
J4—-1 X 
e Y d. m 
jel kml 
Jo—-1 2! 
« pant Y 3 Okai < ajos ues) 
JJ, k=l 
Jn~1 27 
$ P( 5y 5 «25.94. T 
jody k=l 
= Drt 240. 
Set y^ = 2c 31-4 0 i and Z = y7! xm S 0, kzjk. Then Z is a stan- 


dard Normal variable and it follows from (48) in Lemma 5 that y? < 228(] — 
228-1 2/0428), 28/0-28). Hence, 


Tsy x P(Z < -278-M1 — 2728123, p2/0-48)- 1/1428) 


(61) x nV (48) - 1/048) 


a 
= P(Z < —zaj4) = " 
It follows from Lemma 3 with m = 2/2 — 2^ and d = a4M?/ (450g 1/0448) / m 


that 732 < e71/944 — &. Equation (20) now follows from this together with (58), 
(59), (60) and (61). O 


5.3. Proof of Theorem 5. The proof of Theorem 5 is similar to that of The- 
orem 4. We shall omit some details and only give a brief proof here. Suppose 


0 € B7 (M). Set bi = 2log!/?(2), b; = 4Ax 245) and bs = 224 + 8447 — 1. 
Then, from (26) we have 


Eo (r2) = (b +b2)N 7 n7! 


J—1 
(62) t Y Eo (xs, —Ln I(S;, < 22) 
j=0 1 


+ b3Ln | Eo (Card((j, i): S}; > A4Ln 71). 
The last term can be easily bounded using Lemma 5 as 
b3Ln ^ Eg(Card((j, i): S7, > A,Ln^!]) 


«b, min(Nn-!, D. MA ger ree 


T 


226 T. T. CAI AND M G. LOW 


Set D = Bea, Eo, (S2 -— Ln (S^. < A Ln ))),. Using nearly identical 
arguments given in the derivation of (55) and (56) in the proof of Theorem 4, D is 
bounded as D < 4N!/2n-!4.C, M?/(-*29), —2:/(0. 22) for some constant C, > Ay. 
On the other band, it is easy to see that D « xm Y;Q«Ln — En!) = 
(Ay = DNn^! and consequently SUP6eBr (M) E(rZ) < (bi 4- b> + 4)NU?4-1 ER 
C, min(Nn7!, M2/14204-2:/0-21)). ' 

We now turn to the coverage probability. Again, set C (0) = P(lÓ — 0 l2 > 5. 
We want to show that supgcgs C(0) <a + 4(logn) !. Note that 


J-—1 
lô — 615 = J Y 57,157; S ALn’) 
J=0 o: 


J-i 
+n Y Y X2 ISZ, > ALn’). 
J—0 1 
Set £5 = {(j,i):§*, > AA4Ln^ 1) and £4 = (Gj, i) :£7, < 4A,Ln^. It then fol- 
lows from the definition of the radius ry given in (26) that 


C(6) « P| SO ER (SS, x 4Ln P) +n! 72 TSF; > ALn) 
Ciel, 
> $O S}, Ln (7, S A Ln) 
Get, 


+ b3Ln ^ IS, > A«Ln 1] 


+ P| 35 ELI? x ALn!) E n! x? ICS,  A4Ln7] 
Gel, 


> (bı + ba) N'n! 
+ Y S2, Ln S, < ALn’) 
Ged, 


+b3Ln tI (S}; > Manil 


= T1 4- T5. 
We first bound the term 71. Similarly as in the proof of Theorem 4, 


(63) Tix Y (P(S?, X J4Ln |) + P(x}, > b3L)) E n ^ Card(45) x L™. 
Q DEL 


ADAPTIVE CONFIDENCE BALLS 227 


On the other hand, note that 


Ty = P| 3. K5;, — 87, — Ln ) + baLn 1S? > A4Ln ))] 
(Det, 


< —(bi + b2) Nn! 


+ JO KS; tn E, Ln, > Macy 
Geli 


= P| > (55; -E — Ln )) < - (b Hiat] 


UDEL, 


+P| 35 (Sitna, — 87, — Ln) (85, > Aen") 
(Gaeta 


> ` bna S, utat 


(Gajet, 
= hi + T. 
Set Ž i= ES, DP oen j 0, kZj,k. Then Z ji is a standard Normal random variable 
and 


T) = P| Y^ (an Zp -n x7; - Ln) < -Cb + ZEE 
Ged) 


< P| Y; x7, <b Nt? + carat 
G,Det, 


+ P| Y tule aa 
Cnet, 

If Card(f4)L < bi N1, then PY yen X2, < —biN!? + Card(44)L} = 0. 

When Card(£4)L > b1N!7, equation (45) with m = Card(f14)L < N and d = 

bi N17? /m yields that 


P| >, Ki < —b,N'/? ~-t- CaL « cC 1/4d*m « eC Ub] = q 
Gaet, 2 
On the other hand, note that ÈG, iet, Ei < NL! . 4A Ln! = 4A, Nn! and 


hence PL (yen EjiZja < -Ab2NT?n-1?] < P(Z < —j body”) < $ where 
Z ^ N(0, 1). 


228 T. T CAI ANDM G.LOW 


We now turn to the term 755. Note that E. < 4A. Ln^! for (J,i)€ Li. Hence 


Tr < Y, Pn £,Zj 2n x5 > (3 + DLn |!) 


(jel, 

< Y) P,» 167} (bs— 2+ DLN) + E PO > Ae) 
Gaet, Gel, 

< Y! P(Z,,>2(logn)'/?)+ Y. in?^xLNn^zL-. 

(eli GEL, 


Hence, C (6) < Ti + hi + Tj; x a +2L7! =a + 2(logn)!. 


REFERENCES 


[1] BARAUD, Y. (2004). Confidence balls ın Gaussian regression. Ann Statist. 32 528-551. 
MR2060168 
[2] BERAN, R. (1996). Confidence sets centered at Cp-estimators. Ann. Inst. Statist Math. 48 
1-15. MR1392512 
[3] BERAN, R. and DUMBGEN, L. (1998) Modulation of estimators and confidence sets. Ann. 
Statist. 26 1826-1856. MR1673280 
[4] Car, T. (1999). Adaptive wavelet estimation: A block thresholding and oracle inequality ap- 
proach. Ann. Statist 27 898—924. MR1724035 
[5] CAI, T. (2002). On block thresholding in wavelet regression: Adaptivity, block size, and thresh- 
old level Statist Sinica 12 1241—1273. MR1947074 
[6] CAI, T and Low, M. (2004) Adaptive confidence balls. Technical report, Dept. Statistics, 
Unrv. Pennsylvania. 
[7] DoNoHO, D. L. and JOHNSTONE, I. M. (1998). Minimax estimation via wavelet shrinkage. 
Ann. Statist. 26 879-921. MR1635414 
[8] GENOVESE, C R. and WASSERMAN, L (2005). Confidence sets for nonparametric wavelet 
regression Ann, Statist 33 698-729. MR2163157 
[9] HOFFMANN, M. and LEPSKI, O. (2002). Random rates in anisotropic regression (with discus- 
sion) Ann. Statist. 30 325—396. MR1902892 
[10] TJUDITSKY, A. and LAMBERT-LACROIX, S. (2003) Nonparametric confidence set estimation. 
Math. Methods Statist. 12 410-428. MR2054156 
[11] Lr, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17 
1001-1008 MR1015135 
[12] ROBINS, J. and VAN DER VAART, A (2006) Adaptive nonparametric confidence sets Ann. 
Statist. 34 229-253 


DEPARTMENT OF STATISTICS 

THE WHARTON SCHOOL 

UNIVERSITY OF PENNSYLVANIA 

PHILADELPHIA, PENNSYLVANIA 19104-6340 

USA 

E-MAIL. tca1@wharton upenn edu 
lowm @ wharton upenn edu 


The Annals of Statistics 

2006, Vol 34, No 1, 229-253 

DOE 10 1214/009053605000000877 

© Institute of Mathematical Statistics, 2006 


ADAPTIVE NONPARAMETRIC CONFIDENCE SETS 


BY JAMES ROBINS AND AAD VAN DER VAART 
Harvard University and Vrije Universiteit Amsterdam 


We construct honest confidence regions for a Hilbert space-valued pa- 
rameter 1n various statistical models. The confidence sets can be centered at 
arbitrary adaptive estimators, and have diameter which adapts optimally to 
a given selection of models The latter adaptation 1s necessarily limited in 
scope. We review the notion of adaptive confidence regions, and relate the 
optimal rates of the diameter of adaptive confidence regions to the minimax 
rates for testing and estimation. Applications include the finite normal mean 
model, the white noise model, density estimation and regression with random 
design. 


1. Introduction. Consider an observation X distributed according to a law 
24 depending on a parameter 8 that ranges over a subset © of a separable Hilbert 
space. Specifically, we take the Hilbert space equal to R" with the Euclidean norm, 
or the sequence space £2 = (0 = (61,65,...):9 54 9* « oo) with the squared 
norm |[8 |? = ea ie Our aim is to construct (asymptotic) confidence sets Ch 
of small diameter for the parameter 0, which are "honest" in the sense that, for a 
given confidence level 1 — o, 


. . . jha — a 1 
(1.1) liminf inf Pe@ € Cn) > 1—o 


This problem has been considered by, among others, L1 [32] and Baraud [1] in the 
case that © is equal to R” and the observation is a Gaussian vector with mean 0 
and covariance matrix the identity, by Hoffmann and Lepski [20] in the case that 
0 € £2 and the observation is an infinite sequence of Gaussian variables with means 
0, and variance c^/n, and by Beran [4], Beran and Dümbgen [5] and Genovese 
and Wasserman [18] in the case of the fixed design regression model. Our aim in 
this paper is to propose new confidence procedures for these and related models, 
which shed light on some of the questions raised in the discussion of the paper by 
Hoffmann and Lepski [20]. We construct confidence sets with the properties: 


(1) The confidence set is honest on the model ©. 
(ii) The confidence set is centered at an estimator of choice, for example, an 
adaptive estimator. 


Received June 2004; revised May 2005. 

AMS 2000 subject classifications. 62G15, 62G20, 62F25. 

Key words and phrases. Adaptation, white noise model, density estimation, regression, testing 
rate. 


229 


230 J. ROBINS AND A. VAN DER VAART 


(iii) The diameter of the confidence set adapts to submodels of © in a rate- 
optimal way. 


In the second and third points we improve on the results in the mentioned papers, 
at least as regards rates. Our method in its simplest form as presented below leads 
to an increase of the "constants." | 

Since completing our paper we have learned about the work of Juditsky and 
Lambert-Lacroix [25] and Cai and Low [11]. Juditsky and Lambert-Lacroix [25] 
appear to deserve priority in discussing adaptive confidence sets. In their beautiful 
paper they pose the problem within the setting of fixed-design regression with 
Gaussian errors and obtain adaptation in the scale of Besov spaces, using wavelet- 
based methods. An insightful discussion of the problem and basic insights about its 
relationship to loss estimation and minimax estimation and testing can already be 
found in this paper. Cai and Low [11] consider the problem of adaptive confidence 
regions in the setting of the Gaussian white noise model, and obtain adaptation 
in the scale of Besov spaces, also using wavelet-based estimators. Our method is 
more flexible and applies to more settings, but we develop the results only for the 
scale of Sobolev spaces. In certain respects it is close to the method of Juditsky 
and Lambert-Lacroix [25]. 

As is pointed out in the preceding references, the desired honesty (1) severely 
limits the possibility of adaptation as in aim (iii). In the past years many successes 
have been obtained in the construction of estimators that are simultaneously min- 
imax over a large selection of models. (See, e.g., [3, 2, 13-16, 19, 29-31, 34, 
37].) These estimators are able to adapt to the "regularity" of the true underlying 
parameter, without pre-knowledge of the parameter or its regularity. However, as 
pointed out by Birgé [7], these estimators have the property of being close to the 
true parameter without the statistician being able to tell how close it 1s. An adap- 
tive estimator can adapt to an underlying model, but does not reveal which model 
it adapts to, with the consequence that nonparametric confidence sets are necessar- 
ily much wider than the actual discrepancy between an adaptive estimator and the 
true parameter. 

If one drops "honesty" (1) from the requirements of the confidence set, but re- 
quires, for instance, only that the confidence set is honest over every submodel 
©, C © of interest [i.e., (1.1) with © replaced by 6], then this embarrassing 
problem disappears, and it is possible to construct "confidence sets" of a diameter 
that adapts to the estimation rate. Most procedures ın the literature fall in this cat- 
egory. However, dropping full honesty (1) appears to contradict the very definition 
of a confidence set. In this paper we require honesty in the sense of (1.1) with € 
the collection of all parameters deemed possible. Thus we consider a list of models 
and require honesty on the "biggest model” © in the list. 

Under this requirement the possibilities for adaptation are severely limited. For 
a given submodel ©, C ©, the diameter of a confidence region that is honest for 
© cannot be of smaller order, uniformly over 8, than: 


ADAPTIVE CONFIDENCE INTERVALS 231 


(a) The “slowest rate" ¢, — O such that for any estimator sequence 7, and 
some B > a 


mint up Po (ITa — 0]| = En) > B. 


This is typically the minimax rate of estimation for the model 8. 

(b) The minimax rate of testing of the hypothesis Ho:0 € ©} versus the alter- 
native Hi :0 € ©, ||@ — 61] > £n, for any given O4 C O1, for example, a one-point 
set ©; = (01). This rate is often determined by the full model ©, rather than the 
submodel €3,. 


These lower bounds appear to be well known. Juditsky and Lambert-Lacroix [25] 
discuss such bounds in the setting of Besov spaces. For completeness we give 
precise statements in Section 6. 

Our confidence sets have diameter of the order the maximum of the rates in 
(a)-b), simultaneously over many submodels, at least for regularity classes as in 
the following example, and hence satisfy aim (iii). 


EXAMPLE 1.1 (Regular parameters). A parameter 6 € £2 can be called 
B-regular (for a given B > Q) if it belongs, for some L > 0, to the ellipsoid 


OO 
S(B, L) = | Ef: Gi"? <L}. 

i=] 
If the coordinates of 6 correspond to classical Fourier coefficients, then S(B, L) 
corresponds to periodic functions with B derivatives bounded by a multiple of L 
in £[0, 1]. (For real functions and the sine—cosine basis the correspondence is 
more accurate if we replace i*? by (i — 1)7? for odd values of i. See, e.g., the 
Appendix of [38].) 

Consider inference on 0 € S(D, L) based on observing each 0; with an inde- 
pendent N (0, a7 /n) error, or in one of the other models discussed below, which 
yield similar results. The minimax estimation rate for S(B, L) is n P/QGP*D (cf. 
e.g., [9, 21, 22, 36]). For £, > B and Lı < L we have S(fi1, L1) C S(B. L) and 
the minimax testing rate of S(B1, L1) relative to S(8, L) in the sense of (b) is 
n P/QOE*1/2) & n—B/28+1)_ (See [23], Theorem 2.1 or 3.1, or [24].) 

If the supermodel © is equal to S(B, L), then these bounds suggest that the 
diameter of a confidence set can be of diameter of order n-?/(8*U uniformly 
over @, and of order n -Pi/CBi +) y ,78/081/2 uniformly over the smaller model 
©, = S(B1, L1) for £1 > B and L4 < L. If B4 € (8,28), then the latter rate is 
equal to n 71/€&1*U and depends on the submodel. In that case we may say that 
adaptation occurs. 

This type of adaptation is very different from adaptation in the context of esti- 
mation. For £1 > 2f the diameter is n~8/(@8+!/2) independent of the exact value 
of B1, so that further regularity does not yield smaller confidence regions. Even on 


232 J. ROBINS AND A. VAN DER VAART 


TABLE 1 
Order of maximal diameter of confidence regions on the submodel 
S(B1, L1) C S(B, L), cut-off points and number of 
observations needed to estimate a? 


p B1  Radiuson S(fj, Lj) Cut-off Obs for o? 


1 22 42/5 n2/5 »n5 
1 3/2 n73/8 n^/5 »n2/5 
I Í n V5 n^/5 »n^/5 

1/2 >I] n- 1/3 n2/3 >n2/3 

1/2 3/4 n ?/10 n^/3 »n^/3 

1/2 1/2 n- 1/4 n2/3 nh 

1/4 >1/2 n- 1/4 n »n 

1/4 1/4 n V/ó n »n 

1/8 >1/4 n- V6 n4/? »n1/) 
0 >0 1 n? Dn? 


very smali submodels (8; — oo), the diameter of a confidence region is at least 


of the order n -P/CP1/2. determined by the supermodel. As illustration, Table 1 
gives the rates for some values of the regularity parameters. The meaning of the 
last two columns of the table is explained later on in the paper. 


Our method to construct confidence regions, described in Section 2, is based 
on a sample-splitting procedure. We use half the data to construct centering es- 


timators 6), and an independent second half to construct a confidence region 
around 6“). The nature of the initial estimator Ô®™ is irrelevant for the honesty 
of the confidence procedure, and hence 0? can be any of our favorite estimators. 
In particular, it can be an estimator that adapts to a selection of models of our 
choice. Our procedure borrows its adaptive strength from these initial estimators, 
but of course only up to the limitations described earlier. 

Refinements of this procedure would be to construct two confidence sets, with 
the roles of the two half-samples interchanged, and to take the intersection, or to 
split the sample into more parts. For restricted supermodels © the splitting may be 
avoided altogether. This may lead to better constants in the centering and diameter 
of the confidence set. In this paper we are interested in rates only, and for this our 
simple sample-splitting scheme suffices. 

In the case that the observations are a random sample, we can form the two 
halves of the data by simply splitting the sample into two parts, using the first half- 
sample to construct the estimator 6“) and the second to construct the confidence 
region. In other examples of interest a similar situation can be created using a more 
involved splitting device, which we describe below. 

The organization of the paper is as follows. In Section 2 we describe the con- 
struction in a general framework. In Sections 3, 4 and 5 we give the details for the 


ADAPTIVE CONFIDENCE INTERVALS 233 


three main examples, sequence models, density estimation and random regression. 
Finally in Section 6 we relate the diameter of a confidence region to the testing and 
estimation rates. 

We close this introduction with a description of a number of examples to which 
our construction applies, together with a review of the literature. 


EXAMPLE 1.2 (Finite sequence model). In this model the observation is a 
vector XV? = (X4, X5,..., Xn) from an n-dimensional normal distribution with 
mean vector 0 = (01,05, ...,0,) and covariance matrix (o? /n)I. The variance c? 
is known and the parameter 0 is known to belong to a subset © of R”, which may 
be all of R”. 

This model was studied in [32] and [1] under the assumption that © = R”. The 
naive procedure in this situation is the chi-square region {9 € R”: ||6 — X (02 < 
(o? /n) x2 ,..,), which derives from inverting the likelihood ratio test. It has diam- 
eter of order 1, uniformly in (and independently of) 8. 

Li [32] showed that requiring honesty relative to all parameters 6 € IR" im- 
plies that no confidence region can achieve a diameter that is uniformly smaller 
than n^ ^, and exhibits confidence regions around shrinkage estimators that may 
achieve the rate n !/^ on the submodel where the shrinkage estimator performs 
well. Li's confidence sets improve on the naive chi-square procedure at true pa- 
rameters where the shrinkage estimator improves upon the naive estimator Xf, 
Baraud [1] constructs confidence regions that improve on the naive procedure in a 
wider range of submodels. His procedure is based on comparing a range of sub- 
models by chi-square tests. The confidence regions in the present paper manage to 
adapt to still more submodels, if the initial estimators 0*9 are chosen so as to fully 
profit from the recent insights in adaptive estimation, such as in [8]. 

It is notable that in this model the variance o? is assumed known. Baraud [1] 
shows that in the case that o? is an unknown parameter ranging over some interval 
(even a very short one), confidence regions that are honest over © = R” and o? 
can never have diameter less than order 1. 

Because the observations in this example are non-ii.d., splitting the sample 
is not a good device in order to separate constructing a center and a radius of 
the confidence region. However, we may artificially produce two normal vectors 
X' and X" with means 0 from a given N,4(0, (o? /n)I)-distributed random vec- 
tor X using randomization. Given a sample of independent, uniform variables U; 
independent of X, it suffices to define 


X, = X,+07'U,)o/Jn, 
X” =X,-®1U)a/Jn. 


Then it can be verified that X; and X? are independent random variables with 
means 6; and variances 20? /n. Thus the observations can be duplicated at the cost 
of multiplying the variance o? by 2. In the remainder of the paper we shall assume 


234 J. ROBINS AND A. VAN DER VAART 


that a device of this type has been applied, and write X? for the second sample 
(on which the estimate of the radius of the confidence set is based), and assume 
that this is independent of the initial estimator 0? for 0. 

Knowledge of o? is crucial for this randomization step. Good estimators would 
do as well, but it is impossible to estimate o? in this model without restricting the 
mean parameter Ó to a proper subset © of R". Baraud [1] shows that the size of a 
confidence set can never be of smaller order than the imprecision in o. 


EXAMPLE 1.3 (Infinite sequence model). In this model the observations are 
an infinite sequence X) = (X, X2, ...) of independent random variables X, pos- 
sessing normal distributions with means EX, = 6; and variance o?/n. The para- 
meter is the mean vector 0 = (01,605,...) and is known to belong to a subset © 
of £5. 

This model is a version of the white noise model, and 1s considered in connec- 
tion to confidence regions in Hoffmann and Lepski [20]. (The focus of these au- 
thors is on "random normalizing constants" rather than confidence regions, but, as 
most of the discussants of their paper, we interpret their results with respect to their 
implications for confidence regions.) Hoffmann and Lepski [20] assume that there 
is a largest model € of interest, and exhibit confidence regions that are adaptive 
to finitely many submodels. Our construction allows infinitely many submodels 
and yields confidence regions around arbitrary initial estimators 0€). for example, 
adaptive ones. Hoffmann and Lepski consider the general setting of anisotropic 
regression models, but we illustrate our method for the regularity classes of Exam- 
ple 1.1 only. 

We can use the same device as in Example 1.2 to duplicate the observations, at 
the cost of doubling the variance o7. 

Typically one chooses © to be a relatively small subset of £2. Then it is easy 
to find good estimators of o7, and it is not necessary to assume that o? is a priori 
known. For instance, if © is an ellipsoid of the form (0 € £2: °°, 02i?P < oo}, 
then we may base an estimate of o? on the observations Xk+1, Kk5,..., Akin 
for sufficiently large integers k, m, which are approximately N (0, o?)-distributed 
for large k. The availability of an infinite sequence allows one to control the bias 
and variance of estimators of o? to arbitrary precision by choosing k and m, re- 
spectively, sufficiently large. 


EXAMPLE 1.4 (Density estimation). In this model the observation is an i.i.d. 
sample X,,..., Xn from a density f relative to some measure jz on a measurable 
space (X, 4). The density f is known to belong to a subset F of L2(X, A, 4). 

We can cast this example into a problem of estimating a sequence 6 = 
(61, 65, ...) of parameters by expanding f on a fixed orthonormal basis e1, e», ... 
of L2(X, A, u). This expansion takes the form of the Fourier series f = 5°, 0,e;, 
for the Fourier coefficients 0, = ( f, e,) = Be; (X1). 


ADAPTIVE CONFIDENCE INTERVALS 235 


The empirical Fourier coefficients Y, = n^! $7 21 (Xj) are unbiased estima- 
tors of the parameters 0,. However, they are only approximately normally dis- 
tributed and not independent, and it seems not fruitful to cast this example into 
the framework of the sequence model of Example 1.3 with observational vector 
(Yi, Y2,...). The Le Cam equivalence of the white noise model and the density 
estimation model, proved under conditions by Nussbaum [35], offers a different 
connection between the two examples, but can be used only if F is restricted and 
yields regions of complicated form. (The latter objection is alleviated by the recent 
constructions of Brown, Carter, Low and Zhang [10].) Our direct approach gives 
concrete confidence sets and in wider generality. 

We can split the sample into two independent halves to construct the center gm) 
and the radius R,, (6 (n) of the confidence set. 

There is no parameter o? to be dealt with in this example. 


EXAMPLE 1.5 (Random regression). In this model the observation is an i.i.d. 
sample (X1, Y1), ..., (Xn, Yn) from the distribution of a vector (X, Y) described 
structurally as Y = f (X) + e, for (X, £) a random vector with E(e| X) = 0 and 
E(£^|X) < oo almost surely. The regression function f is known to belong to a 
subset F of Lo(X, A, Px) for Py the marginal distribution of X, which is as- 
sumed known. The variance function a^(x) = E(e?|X = x) need not be known, 
although for confidence intervals that are honest in o? we need a known upper 
bound. We do not assume that the errors are normally distributed, and we do not 
assume that X and e are independent. 

As in Example 1.4 we can cast this example into a problem of estimating a 
sequence 0 = (01,05,...) of parameters by expanding f on a fixed orthonor- 
mal basis e1,€2,... of Lo(X, A, Py). The Fourier coefficients take the form 
0, = (f, ei) = EeiCX)Y. 

The Fourier coefficients can be estimated unbiasedly by the estimators Z, — 
n 7 4 Ye (X ;), but, as in Example 1.4, it appears not useful to try and reduce 
the model to the sequence model of Example 1.3 by considering (Z1, Z2,...) as 
the observation. 

The assumption that the design distribution Px is known may be realistic in 
some practical situations, but is unpleasant. Perhaps it is a little surprising that 
it is not a merely technical assumption, but essential for the construction of our 
confidence sets. We intend to show elsewhere that the radius of the confidence 
sets will increase if Py is unknown, in varying amount, depending on what a pri- 
ori assumptions are made on Py. If Px is completely unknown, then intuitively 
this model should be equivalent to the fixed design regression model discussed in 
Example 1.6. 


EXAMPLE 1.6 (Fixed regression). In this model the observation is a vector 
Y —(Yi,..., Yn) of independent random variables distributed according to the re- 
gression model Y; = f (x,) + €, for €1,..., £4, i i.d. normal variables with Ee, = 0 


236 J. ROBINS AND A VÀN DER VAART 


and Ee? — c? and x1, ..., x, known constants. The variance o” is known and the 
function f is known to belong to a subset F of L2(3X, A, u) for some distribu- 
tion 4. 

Genovese and Wasserman [18] put this model in a sequence framework by ex- 
pansion of the regression function on an empirical wavelet basis. They justify 
Beran [4] and Beran and Dümbgen [5] REACT confidence sets in terms of an 
honest confidence set over -regular regression functions f, described in terms 
of a wavelet expansion. This is also the model treated by Juditsky and Lambert- 
Lacroix [25]. 

The model can be seen to reduce to a version of the finite sequence model of Ex- 
ample 1.2. All information about the regression function f outside the design set 
(x1, ..., Xn} must stem from the model and not from the data. This point was made 
previously in Li [32], who gives the regression model as motivation for studying 
the finite sequence model. We shall not further discuss this model separately. 


2. Construction of confidence regions. Our method is based on sample split- 
ting. We suppose that initial estimators 6) are given, and construct the confidence 
region based on 0€? and an additional independent observation X ?. It was dis- 
cussed previously how to split the data into independent "halves" that can be used 
for constructing 0? and X). The nature of the initial estimator @™ is irrelevant 
for the honesty of the confidence procedure, and hence 6) can be any of our fa- 
vorite estimators. In particular, it can be an estimator that adapts to a selection of 
models of our choice. 

Our confidence regions are based on estimators R, (6) = R,(6™, X™) of 
the squared norm ||6 — 8? 2? such that 


(2.1) — liminf inf Po (Rp (00?) — |0 — 60? |? > —za ĉn lô) = 1— o, 


for "scale estimators” fa o and “quantiles” za. The probability is computed con- 
ditionally given the estimators 6") and hence refers only to the observation X? 
used to calculate R,, (6 (n and Ta o. In view of Fatou’s lemma the unconditional 
coverage probability will also be at least 1 — o. Then the set 


(2.2) C, = (0 € ©: |0 — 69? | < Vz, £o + Rr (6€?) ] 


is an honest confidence region with coverage probability at least 1 — a. (Define 
/x to be 0 if x < 0.) The confidence region C, is in general not a ball. However, 
in all our examples the scale estimators t, o satisfy 


Jo — 60 

vn 
where < denotes smaller than up to a constant which is fixed by the setting and în 
is independent of 0 and determined by the size of the parameter set ©. It can be 


tn.6 < Ti ue 


ADAPTIVE CONFIDENCE INTERVALS 237 


seen from this that the diameter of the confidence region satisfies 
(2.3) diam(C,) € Vt + V Ra (80?) +n! 


(See the proof of the proposition below for a precise argument.) The last term on 
the right is the parametric rate of estimation and is typically negligible relative to 
the other terms. The first term /?, depends on the supermodel © and its size is 
typically the same on every submodel. 

The possibility of adaptation hinges on the second term. Typically (2.1) extends 
to a full, two-sided comparison, of the form |R,(0™) — ||g — 60? |?| = Op, (8, o) 
uniformly in 0 € ©. Then it follows that the diameter of C, is of the order, uni- 
formly in 0 € 0, 


diam(C,) = Op, (V €, + [99 — 0| +n). 


The diameter of the confidence set on a given submodel ©; C © is bounded above 
by the biggest order of the expression on the right-hand side under 0, for 0 ranging 
over ©,. For small submodels, or more generally submodels where the estimators 
6) perform well, the diameter will be dominated by the term /7,, the rate of 
the estimators of ||0 — ĝm) |2. On the other hand, in bigger submodels the term 
a 0) — 9|| may dominate. It is thus that we achieve adaptation to smaller models, 
but only up to the order J/£,. 

It is apparent from the preceding description that our confidence regions depend 
crucially on good estimators of the squared distance |}@ — 6? |? of the parameter 
0 to the point 6. The latter point 0? may be considered fixed, as we condi- 
tion on the initial estimator. The problem of constructing such estimators is there- 
fore closely connected to the problem of estimating the squared norm ||@||* of a 
Hilbert space-valued parameter. In some examples this is straightforward, but in 
the situations of density estimation and regression this problem is more involved. 
Fortunately, in the latter cases the estimation of a “quadratic functional” has been 
studied in detail by, among others, Fan [17], Bickel and Ritov [6], Laurent [26, 27] 
and Laurent and Massart [28], whose work obtains additional relevance in the 
present paper. The more recent papers consider adaptive estimators of the squared 
norm, but for our purposes optimal estimation under the biggest model will be 
sufficient. In view of their simplicity we shall adapt the constructions of Laurent 
[26, 27] to our purposes, but other approaches could be used as well. 

This method consists of estimating the squared norm |[I1,0 — 11,6 ||? of the 
projection of the difference 6 — gin) [where I8 = (6;,..., 8k, 0,0,...)] unbias- 
edly and trading off the resulting (squared) bias versus the variance of the estima- 
tor. Under the assumption that 6 takes its values in 6, the bias is bounded by a 
multiple of 


(2.4) B; :— sup || — IT46[j?. 
c8 


238 J ROBINS AND A. VAN DER VAART 


The variance turns out to be of the order, for a parameter c? that depends on the 
setting, 


: 20%k | 4o? |II1,0 — 1,0042 
(2.5) Ins -7 og E E, 


n 


The root Tk n,o of this variance and the bias B? must be incorporated into the 
variable £,,9 as in (2.1). We define ĉn = V20? /k/n + Bz, and conclude, in view 
of (2.3), that the diameter of the resulting confidence set (2.2) is of the order 


c ki /4 
a/n 


We can now choose an optimal value of k by trading off k1/^/./n versus Bx. 

The parameter o may depend on the unknown 6, but in that case must be uni- 
formly bounded over the supermodel ©. 

For later reference we formalize the preceding as a proposition. Rather than 
making assumptions on bias and variance, we assume that the estimation rate of 
the estimators Ry nÔ (ny is of the order as in the preceding discussion: for Tk 5,6 
as in (2.5), some number Zg and any sequences kp — oo and M, > oo, 





2.6 B24 4o — ó€)| + Z 


(2.7) limsup sup Pe (Rg, (6) — |T, — T14,00? |* < —2,6,, n0) < a, 
n> He 

(2.8) lim sup sup Po (| Ri, n (60?) — | 14,0 — Te, 6 |^| > Mn &, n0) > 0. 
n> 0cQ 


Of course, the second equation implies that the first is satisfied for sufficiently 
large Zæ, whereas an “absolute version" of the first equation for all œ c (0, 1) will 
imply the second one. 


PROPOSITION 2.1. Suppose that Ry , (0€) are estimators that satisfy 
(2.7)-(2.8) for tx. n,o given in (2.5) and some a € (0, a]. Assume that 0? takes its 
values in ©. Then for B, given in (2.4) the sets 


Ó, = {6 EQ: le d gm) | < yV Za tk, n,8 2s Ry, nO) T 2Bx, } 


are honest (1 — a)-confidence sets for 0 € ©, for any kn > œ, with diameter 
satisfying, for any M, — oo, 





_,1/4 
á ok ^ 
lim sup sup Pg (<iam(¢,) > Mal T + Be, + |0 — so) —0 
n—09 peð a/n 


PROOF. By (2.4) the difference ||9 — 0? || is bounded above by |[[II4(8 — 
89)| + 2B,. Therefore, by the definition of C,,, 


Pa (0 ¢ Cn) < Po (|11 (0 — 809)| > V zo &, o + Ri, n (007) ). 


ADAPTIVE CONFIDENCE INTERVALS 239 


In view of (2.7) the right-hand side is asymptotically bounded above by a, uni- 
formly in 0 € ©. Hence C, is an asymptotic confidence region of confidence level 
1 — a. 

In view of the form (2.5) of Tk n,o every element 0 of C. satisfies 


x 20? 4k ^ a20 m 
Jo - à < [as 275. Re, (900) + 2B, + 977 lg o]. 


The inequality x < B + A./x for real numbers x and positive real numbers 
A and B implies that x < 2B + 2A*. We conclude that the diameter of C, is 
bounded by a multiple of 


gk! n 7 
V Ry, 5 (099) + By, + ——. 
Án es kp n( ) T kn + Jn 


The variables Rx, n (6 “)) are with Pg -probability tending to 1 bounded above by 
a multiple of ||T1,, 0 — II;, 06 I? + Mg, n,o, for any given M, — oo, by (2.8). 
Therefore, with probability tending to 1 the diameter of Cn is bounded by 








ji E 
= + lo — 8€? | + A M, no + By, + 





T 
vn 
Here the last term is negligible relative to the first. The proposition follows in 
view of the form (2.5) of £, n,o and the inequality ok!/4/./n+./o/x/n'/4 +x < 
20 k^ |  /n +2x, which is valid for any k > 1, x > 0ando >0. O 


The natural (or “naive”) estimators R,(0 9) of lg — 8€? (2 in our examples 
assume negative values, which could lead to a confidence set e in (2.2) of zero 
diameter. This is unlike the usual situation in parametric models, where ./n times 
the radius of a confidence region for 0 generally has the desirable property of tend- 
ing in probability to a positive constant. In practice it might be useful to eliminate 
the possibility of radii of zero by substituting for the right-hand side of (2.2) a 
more conservative cut-off, given by the maximum of the current right-hand side 


and v za tse (or perhaps V zo 14,0/2 ). 


EXAMPLE 2.1 (Model of dimension n). If © — R", then we can avoid a bias 
by choosing k — n. Then the diameter of the confidence sets is of the order equal 
to the maximum of n~!/4 and the estimation error || — 0? |]. 


EXAMPLE 2.2 (Regular models). The usual models to define regular parame- 
ter are the ellipsoids S(B, L) = (0 € £2: 702, 07i? < L*}, for 8 0 and L > 0 
given. Suppose we choose € = S(f, L) for fixed values of 6 and L as the super- 
model, on which we require honesty, and consider adaptation on ellipsoids defined 
by different parameter values. 


240 J. ROBINS AND A. VAN DER VAART 


If we cut off the series expansion at level k, then the maximal squared bias is 


equal to 
oo co -\ 2B L? 
sup 0 < sup >: e () < 7. 


— 


OESE, L yai PESEL kh] 
+ + 


This leads to the trade-off k!/4/,/n ~ L/k®, resulting in a cut-off of the order 
k ~ LA/G8-0,4,1/0841/2) 


and a bias of the order n~P/(@P+1/2) p A1) 

This choice of k is compatible with k < n only if B > 1/4. Thus if 0 is restricted 
to IR", as in the finite sequence model, then we consider submodels S(f, L) with 
B > 1/4 only. 

For this choice of k we obtain a confidence region for the full parameter 0 € £2 
of diameter of order equal to the maximum of n^ //C8*1/2 and the estimation 
error ||@ — 8Q ll. The lower bound n~8/@8+1/2) and the cut-off k are for some 
values of B given in the third and fourth columns of Table 1. 

Thus the role of the minimal diameter n^ !/^ in the preceding example is now 
taken over by n—P/@P+1/2) 

For the initial estimators 6 there is a variety of choices. A relatively simple 
scheme is to choose 6 to adapt to all regularity classes S(y, M) in the sense that, 
for all y > B and all M > 0, for some constants Cy y, 


sup E9(6™ — 6) < C, un ?Y/ CD. 
OcS(y, M) 

Such estimators exist in the examples considered in the Introduction. In fact, there 
exist estimators that adapt to a much larger collection of submodels than only the 
Sobolev models considered in this paper. Combined with our construction this 
will lead to a confidence region around 0? of diameter of the order n7?/@7+) 
uniformly over S(y, M) if y € [B, 28], and of the order n~B/2B+1/2) over S(y, M) 
for other indices y. 


3. Sequence models. Suppose that we observe an infinite sequence X — 
(X1, X2,...) of independent random variables X; possessing means EX, = 6, 
and variances c? [n. The parameter 6 = (01,05,...) is known to belong to a sub- 
set O of £5. This formulation encompasses both the finite and the infinite se- 
quence models of Examples 1.2 and 1.3, if in the former case it is understood 
that O C Rn :— (0 € £52:0; —0,i > n) and that X,41, Xn+2,... may not be used 
to estimate c^. Our main interest is in the case where the X, are also normally 
distributed, but we also consider the more general situation. The assumption of 
normality allows a precise and simple derivation of the radius of a confidence re- 
gion. In a final subsection we also indicate how to obtain confidence sets with 
guaranteed level for finite n. 


ADAPTIVE CONFIDENCE INTERVALS 241 


3.1. Normal distributions. In this section we assume in addition to the preced- 
ing that each X; is normally distributed. 

Given an initial estimator 6, based on observations that are independent of X, 
our estimator for ||g — 6 ||? is given by 


k 
(3.1) Ren (6™) = S (x, - ô” Y — E 
xl 


n 


Here k = kp is chosen dependent on © and/or ©;, where we must have k < n in the 
finite sequence model. This estimator is combined with the estimator of variance 
(random only in its dependence on 0) 


2ko* 4o? € d 
n2 + n 2-0; 7 6”) à 
i<= 





4 a2 
(3.2) Tk, n,Q 


We shall show that R;.,(@™) tends in distribution to a normal distribution, uni- 
formly in @ € £2. This allows us to construct confidence sets of the type (2.2) 
by using normal quantiles for the values zą. [Because Rg n and Tk, n,o depend in 
fact only on (01, ..., 0k), "uniformly in 0 € £5" means effectively "uniformly in 
(61, ..., %) € R*.”] Because Ry, (00?) is a sum of independent variables, its as- 
ymptotic normality is not a surprise. The main contribution of the following theo- 
rem is that this asymptotic normality is uniform in 0, without any conditions on the 
initial estimators 6. This depends crucially on the normality of the observations. 

The convergence in the following theorem may be understood in the almost 
sure sense. As the proof shows, the weak convergence is actually uniform in the 
values 0. 


THEOREM 3.1. Forany k, > œ as n — oo, 


Po (2 (0) — yn (6, — 0 y? 


~ —> D. 
Tk ,,n,0 


sup sup 
0ct5 xeR 





<x|6) - ec) 





PROOF. We can express the variable (Ry, (007) — y^* (0, — 009)2)/& , g 
in the independent standard normal variables &, defined by X, = 0, + (o/ /n)e, 
as 





20 £ : 
(e7 e —— s (08, — 00 s) 
(Se = + 20-8 


Ti n,Ó 
k 16; 01e, 


B, x (0), 
Y*a(6 - 60 ani 
T L 


1 
= Ye — 1)An k (0) + 
t=] 


242 J. ROBINS AND A VAN DER VAART 


for the positive constants whose squares are given by 
l 
1 + Qn/ko?) Y* (0; — 8)? 
Yx (6 - 6) 

(ka? /2n) + YE 0, — YP? 
By the rotational invariance of the multivariate standard normal distribution, for 
any vector y. with norm 1 the random vector (Qk)-17? p (e? — 1), X Vr 6) 
is equal in distribution to the random vector (Qk)-1/* = 1 (e? — 1), 
pe a £,). The latter vector tends in distribution to a vector of two indepen- 
dent standard normal variables, as n — oo. The coefficients Aj, (0) and By n (0) 
are contained in the unit interval and satisfy A; nh) + B; n (0) = 1, for any k,n, 6. 

We can complete the proof by noting that if a sequence of random vectors 
(Xn, Yn) converges in distribution to a random vector (X, Y), then the sequence 
AX, + BY, tends in distribution to AX + BY, uniformly in coefficients (A, B) 
belonging to a compact set. LI 


Ax, (0)^ = 


By 4 (0) = 


The theorem shows that Rg ,(@) is a good estimator of the squared norm 
of the projection II4(0 — 00?) of 8 — 6™ onto the k-dimensional subspace 
(0 € £5:0, =0,i > k}, and justifies (2.8) with ae of the order as in (2.5). Thus 
Proposition 2.1 yields a confidence region of diameter of the order 


kn 1/4 M 
a(5) + By, + Je — 69]. 
n 


EXAMPLE 3.1 (Finite sequence model). In the finite sequence model of Ex- 
ample 1.2 with © = IR”, we have bias B, zero if we choose k =n. This leads to 
confidence sets of diameter of the order equal to the maximum of n^ '/^ and the 
estimation error |0 — 8? j. 

As was shown by L1 [32] and Baraud [1] the n !/^ lower bound cannot be 
improved upon without losing full honesty. 

We can influence the term ||@ — 8m | by choosing our favorite estimators 80. 
For instance, we may choose any of the adaptive penalized minimum contrast esti- 
mators considered in [8]. As shown by Birge and Massart [8] we can adapt to large 
classes of a priori models by choosing appropriate penalties. 

One choice of penalties leads to estimators that, among other good properties, 
satisfy, for every D, 


: 2 2n 
sup E96 — el^ < |p +los($*"] + || 
0€8p n D 


where Op = (0 € R,,:#(6; 4 0) < D). The confidence sets centered at these es- 
timators attain a uniform order equal to the maximum of n7!/* and ./D/n + 


ADAPTIVE CONFIDENCE INTERVALS 243 


/log(2n/D)/n. As long as D <n this improves upon the order 1 rate attained 
by the naive chi-square procedure, and we obtain the best possible rate n~!/4 uni- 
formly over every set Op with D < „y/n. Thus these excellent adaptation properties 
of 6) result in smaller confidence regions, for more submodels, than those found 
in [1], pages 533—536, by a direct construction. 

The estimators Ry, (0?) and îk n,o in the preceding theorem depend on o? 
and hence so far we have implicitly assumed that (an upper bound on) the variance 
c^ is known. The preceding remains true if it is replaced by a good estimator. 


THEOREM 3.2. The assertion of Theorem 3.1 remains true if o? in the defin- 
itions (3.1) and (3.2) of Ryn and ĉ n,o is replaced by estimators G? such that 


sup Po (Vk, |o —o?| » &| 8) + 0. 
GeO 


PROOF. Represent the observations as X, = 6; + (o/./n )e, for independent 
standard normal variables e,. It suffices to prove the uniform asymptotic normality 
of the variables 


Y* (e? — 1)o?/n + k(o?/n — 67 /n) + 20//n) Y, (6, — ôM e, 
(264k/n?) + (462 /n) YE (0: — 0^2 
Therefore, it suffices to prove that, uniformly in 0 € ©, 
26^/n? +467 /nk Yf (0, -ôP P 
204/n2 + (4o? /nk) XE 4(8, — 80)? 
k an? ES: 
(3.4) - i: RE Ro 
? 204 /n? + (402/nk) Ei 0 — 4)? 


The absolute value of the left-hand side of (3.3) can be rewritten in the form, for 
the constants Cy 4 (8) = 2n/(ko?) Y-*. (6, —6)?, 


(3.3) 


8?|82/a? +. C, (0) | ô? | 
| —1x—--l1l 
o? ] + C5, (0) o? 








Thus this reduces to a/a Es 1, uniformly in 0. Assertion (3.4) is true as soon as 
Jkl? — o?) 5 0, uniformly in 6. (3 


In the finite sequence model with © = IR" there is no possibility to estimate o^, 
and the same is true in the infinite sequence model without some restriction on 
the parameter set O. On the other hand, in the infinite sequence model with a 
restriction to regular parameters, estimation of o? is easy. 


244 J. ROBINS AND A. VAN DER VAART 


EXAMPLE 3.2 (Regular models). For given integers / and m consider the es- 
timator à? = (n/1) Y X?. This bas mean and variance given by 





i—m-4-1 
i m-H 2 m-H 
ge i(LG)e68€ E 
iem PE d 
2 m-H 420 ?8? 2g* 4 2 m-H ur 
EC Een) 
i—m--1 n n? i—m4-1 


It follows that the mean squared error over the regularity class S(B, L) can be 
bounded as 
mo d wd ld 
dM MES mB aH T 
In view of Theorem 3.2 we wish this to be of smaller order than 1/k. 

In the infinite sequence model with © = S(B, L) as the biggest model, we 
choose k = n'/@P+1/2) (cf. Example 2.2) and hence we must choose | > 
n'/(B+1/2) These values are shown for some values of £ in Table 1. For the min- 
imal value of 1 we must choose m > n!/CP+1/2) and then the estimator for o? 
becomes independent of Rg , (0?) A variety of other combinations of (m, 1) will 
do as well. 

In the finite sequence model with © restricted to S(8, L), truncated to R”, the 
choice | >> nl/€8*1/2 can be realized with / < n only if B > 1/4. We can then 
combine it with m of the order n!/@P+1/2) | 


3.2. Nonnormal distributions. The assumed normality of the observations 
X1, X5, ... in the preceding section helps one to obtain precise critical values, but 
it is not important for the general ideas. In this section let X, = 0, + (o/./n )e; for 
an i.i.d. sequence £1, £2,... with mean zero, variance 1 and finite fourth moment. 
Then define Ry; (6 (")) as in (3.1) and define the variance estimator 


ko^var(e?) 40? £ A 4o? £ n 

^2 2 (n) 2 

«pe. d E ) (r -ÂP + ada > (0: — 8,” ) cov(et, &1). 
ies i=l 

THEOREM 3.3. For any k andn, 


R , (8€) xk 9, — 6y2 
inf Po( ben) = at ) 
Gels 


Tk, n, 
E The quantity Ry.,(0 90 i is an unbiased estimator of ys 4 (8, — ~9™)2, 
and t t n.g 1$ equal to its variance. Therefore, the inequality follows by Chebyshev's 
inequality. [O 








Lie 
6 )>1-a. 
< =|)» 


ADAPTIVE CONFIDENCE INTERVALS 245 


The preceding theorem is based on Chebyshev's inequality, which is notably 
imprecise. However, this crude device costs only in terms of the constants and 
not in terms of the rate. If Z is exactly standard normal distributed, then we 
have that P(|Z| > 1.96) = 0.05, whereas the use of Chebyshev’s inequality 
P(|Z| > M) x M^? would replace the normal quantile 1.96 by M = 0.0571 = 
4.5, so that the resulting confidence set would be a bit more than two times too 
wide. 

For many estimators 6“) we can avoid this penalty, because the quantities 
Rian (0 2) will be asymptotically normal, at least under the overall probability law 
governing the initial estimators 0? and the observations X '?. This will depend on 
the initial estimators 0? , but the following assumption appears to be reasonable. 
Assume that the initial estimators satisfy, for some sequence €, — Q, 


kn 
(3.5) sup zi max 6 — a [* > en Y (60? — y) A 


l< =] 


THEOREM 3.4. For any kn > œ as n — oo Such that (3.5) holds, 
yal) == 


- — 0. 
Tk, ,n,0 


sup sup 
0c£f» xER 








<x] — P(x) 


PROOF. We can express the variable (Rx. n (09) — Y (8, — 80 9*5 tex 0 
in the form 





k k 
(3.6) X Ak (0)(67 — 1) + YI Bien Oe, 
res pes] 
for the positive constants given by 
2 
Oo 
Ak n (8) = 
NT, n, 0 
20 (8, — 2p 
Brn (0) = = 
ET TM 


The terms in the sum (3.6) are conditionally independent under P;” given 6, 
and the sum has conditional mean and variance equal to O and 1. If the terms of 
the sum also satisfy the conditional Lindeberg condition in probability, then the 
variables (3.6) converge conditionally in distribution, in probability. We wish to 
show that this is true uniformly in 0 € ©. 

Thus it suffices to prove that for every k, — oo, every ô > 0 and any sequence 
{ên} C O as n — co, 

kn 


2 
2 Eo, (An, n (On) (67 — 1) + Bi, n (85)6:) Ly A gyn (05)(62—D--Bu n (6061-8 > O- 
i=l 


246 J. ROBINS AND A. VAN DER VAART 


For any c € [0, 1) and positive numbers A, B we have that (1 — c) (A? + B^) < 
A? + B? — 2cAB. Because the correlation c between A and £; is nonnegative and 
strictly smaller than 1, this inequality can be used to see that 


ko'* var(e " 
a-o( RD. E S ye- joy eis 
Consequently, 


5(n)42 
1|  maxiz;«x(0, — 0, ^) 
max (Aj, (8) + B? P3 (8)) S — + o———————— 

n k,n k xt 1G; - 602 
By assumption the right-hand side POnVETEES to zero in probability, as k = kn — oo 
and n — oo. We also have that SE (Aż a (9) + B; k, ,(@)) is uniformly bounded. 
We can conclude that the Lindeberg condition is satisfied. O 


3.3. Exact simulation. The procedures in the preceding section can be imple- 
mented as soon as the lower-order moments of the errors e, are known (or can 
be estimated). If the full distribution of the errors is available, then we may also 
obtain exact, finite-sample confidence regions. This observation is even of interest 
in the case of Gaussian errors. 

The variable (Ry, (809) — 5-* (0, —60?)2)/8, n,o can be written as a function 


Sk. n(£1,..., En, 8 0,48), as in (3.6) in the proof of Theorem 3.4. This representa- 
tion allows simulation of the distribution of the given variable under 0, for every 
fixed 0 € ©. Thus in principle we can find the a-quantile —Z»(@) of this distribu- 
tion, for every 0. Then C, given in Proposition 2.1, but with Za replaced by zg (0), 
is a valid (1 — a@)-confidence region. 

Under the conditions of Theorem 3.4 the quantiles z4 (0) converge to Gaussian 
quantiles, uniformly in 0. 


4. Density estimation. Suppose that we observe an 1.i.d. sample X1,..., X, 
from a density f relative to some measure jz on a measurable space (X, +). Let 
6 = (041,05, ...) be the Fourier coefficients of f relative to a given orthonormal ba- 
sis of Lo(X, A, 4), and let © correspond to the collection of all densities deemed 
possible. Assume that the densities 0 € © are uniformly bounded. 

Given an initial estimator 6 our estimator for ||6 — 0? ||? is given by 


Ry, (8) = >> DIC (Xr) — 6) (e (Xs) — 67). 


pay t=] 
Here k = k, is chosen dependent on ©. We combine this with the variance estima- 
tor 


. 2k| FIZ . Alf lloo c xii 
DAE m s 4 SH Too s/o, if 2g 


ADAPTIVE CONFIDENCE INTERVALS 247 


THEOREM 4.1. Forany k,n, 


A k 50042 
pr, (At 2 X iu e ) 


66@ Tub 





o) <1, 


PROOF. The estimator Rg „(Ê®™®) is a U-statistic of order 2 with kernel 
h(x, y) = 1 (e (x) — by (e, (y) — à" )). Its mean is equal to 


k 
Eh(X1, X2) = Y (6, - 6?y*. 


=Í 


Its Hoeffding decomposition (e.g., [39], Section 11.4) is 


" dus 
Ry, (007) = Eh(X1, X2) + — 5 | PAG) 
pel 


(4.1) 1 
e n(n i D 2.2. Piz2h(Xr, Xs), 





for the “kernel functions” given by 


k 
Pih(x) =2 (6, — 8) (e, (x) — 6), 


i-i 


k 
Pi 2h(x, y) = X (ei (x) — 8)(& (y) — 61) 


i=] 


k k £ 
= X e (x)ei (y) — ) 6 (e (x) +, (y)) zi ace 
1—1 


i—i i=] 


The three terms of the Hoeffding decomposition and also each of the individ- 
ual terms in its sums are uncorrelated. Furthermore, the variance of the last term 
in (4.1) is equal to 2/(n(n — 1)) var P1 2A (X1, X2). 

The variance of a factor in the linear term can be bounded as 


k 2 
var( Pj h(X4)) = «(76 — 0e, a) 


i=l] 


k 2 k 
< 4| f lloo / "C - Ae, du — 4| f loo 9 (6 — 6)", 
i=] 


t=] 


bv the orthonormality of the functions e; in L2(j2). 


248 J. ROBINS AND A. VAN DER VAART 
The variables Y iei (X1) — 06,)(ei (X2) — 0.) and p» 0, (e. (X1) + &(X2)) 
are uncorrelated and their sum is ar e (X)e (X5) + 3» 02. It follows that 


k k 
var(Pi,5h (X1, X2)) = var? e;(X1)e,(X2) — var» 0, (e (X1) + e (X2)). 


i=] ii 


This becomes bigger if we leave out the second variance on the right and re- 
place the first variance on the right by the second moment EO. 16 ( Xe (X »))’, 
which can be bounded by 


k 2 
HESI (Lawe o) du) duly) — kl f 2, 
i=1 
by the orthonormality of the functions e; in L2(u). O 


By Markov's inequality, if we choose zy = ./1/a, then 


inf Po( 


The present variance & n,o has exactly the same form as in Section 3, with || f ||oo 


k 
Rin (6) m (6 u ĝm)? 


i=j 








< Za Tk,n,6 | je >l—a. 


playing the role of o^. For more precision we can express || f ||oo in 6, and it is not 
necessary to know a uniform bound on the regression functions. The approxima- 
tion (2.8) with 23 n,o Of the order as in (2.5) is again satisfied and Proposition 2.1 
yields a confidence region of diameter of the order, with M a uniform bound on ©, 


ku , 
MC) + By, + | P — 0€]. 


The corollaries for, for example, regular models are the same. 

Depending on the basis functions e,, the resulting confidence region can be 
tightened by using higher moments or exponential bounds. Finding an exact limit 
distribution appears to be not straightforward. Existing limit results for U -statistics 
with changing kernels (e.g. [33]) are based on approximation of the kernel by a 
finite product kernel of fixed dimension. In our case the kernel is already in product 
form, but the increase in its dimension k is essential. 


5. Random regression. Suppose that we observe an 1.1.d. sample (X1, Y1), 
..«, (Xn, Yn) from the distribution of a vector (X, Y) described structurally as Y = 
f (X) +e, for (X, £) a random vector with E(e | X) = 0 and o*(x) =E(e? | X =x) 
admitting a bounded version. The distribution Py of X 1s known and 61, 6, ... are 
the Fourier coefficients of the regression function f relative to a given orthonor- 
mal basis e1, e2,... of L2(Px). We assume that the set of regression functions is 
uniformly bounded. 


ADAPTIVE CONFIDENCE INTERVALS 249 


Given an initial estimator 6 our estimator for || — 6™ |? is given by 


JEDE ei (Xr) — 6) (Y,e; (X,) — 6). 


pertes Fris 
Here k = k, is chosen dependent on ©. We combine this with the variance estima- 
tor 





Ry a (8) = 


M 


3 0 2k flos + lo Hoo)? y HF l + Allo Mos IM "y 


gins? 
= m^ 
fme n(n — 1) 


THEOREM 5.1. Forany k,n, 
A k A(n) 
sup Eg (( Rin O™) ~ 35-40. =") ry 


968 tk n6 





89) <1 


PROOF. The proof is similar to the proof of Theorem 4.1. The variable 
Ry, (8?) is again a U-statistic of order 2. It has mean Y 4(6; — — guy and 
Hoeffding decomposition [cf. (4.1), but replace X; by (X,, Y,)] with kernels of the 
form 


k 
Pih(x, y) 23 (6, — ô) (ye, (x) — &), 
r1 


k 


Pi 2h(x1, y1, x2, y2) e X (yie: (x1) — 9.) (ae (x2) a 6, ) 
pz] 


k 
= $ y»eGuei(x2) 


i=l] 


k k 
- $0, (yie (x1) + y2er(x2)) + $07. 


ES i=] 


By the orthonormality of the functions e; and arguments as in the proof of Theo- 
rem 4.1, 


k 
var PLA(X, Y) < 4[EQ'?7|X) loo V (6 — 8), 
ix] 


var Pi 2h(X1, Y1, X2, Y2) < [EQ'? 0 |Z k 
From Y = f(X) + © and E(e|X) = 0 it follows that E(Y?|X) = f?(X) + 


E(e?|X) < || fll, + 1o? ||oo. Combining the preceding bounds we obtain the theo- 
rem. O 


250 J ROBINS AND A. VAN DER VAART 


The bound given by the preceding theorem is of the same form as the bounds 
given in the preceding sections, but with || f 2, + |lo7lloo playing the role of o? in 
Section 3. Again (2.8) is justified with ££ „ g of the order as in (2.5). Proposition 2.1 
gives the same corollaries for confidence regions. 


6. Lower bounds. In this section we relate the minimum diameter of a confi- 
dence region to the minimax rates for testing and estimation. Consider a sequence 
of statistical experiments ( pi? :0 € ©) indexed by a parameter 6 € © in a metric 
space (©, d) and a submodel indexed by a subset ©, C ©. We are interested in the 
maximal diameter over ©, of confidence regions that are honest over the whole 
model ©. 

We shall silently understand that appropriate measurability assumptions regard- 
ing the confidence regions are satisfied. 

Given 0 <a < p < 1, let €, be a sequence of positive numbers such that there 
exists no sequence of tests $, satisfying the two requirements, for some given 
subsets O,.; C O1, 


(6.1) limsup — sup — Pg bn «a, 
n->00 HEO d(0,0,1)»£4 


(6.2) limsup sup PP(1 — $4) < f. 

n—oo 0c ni 
This can only be satisfied if a + B. < 1, because otherwise the trivial test ¢, = a’ 
for some a’ with a’ < œ and 1 — a’ < f satisfies (6.1)-(6.2). For B <1—a « 1, 
the condition is satisfied for ¢, equal to what Ingster [23] calls a rate of "not as- 
ymptotic indistinguishability of the hypotheses." The following lemma shows that 
the diameter over ©, of an honest confidence set is at least of the order £,. 


LEMMA 6.1. For given 0 <a « B <1 and subsets ©, 1 C O1, if there exists 
no sequence of tests n satisfying (6.1)-(6.2), then for any sequence of confidence 
sets C, satisfying (1.1), 


lim sup sup Pf? (diam(C,) > én) > 8 — o. 


PROOF. Let Ono = (0 € ©:d(O, 0,1) > En}. Given a sequence of confi- 
dence sets C, satisfying (1.1) define tests by n = 1 d (C, On 9) 0" 

If € 8,9 and d(Cy, 8,0) > 0, then 0 ¢ Cy. Therefore, from (1.1) it is imme- 
diate that these tests satisfy (6.1). : N 

If 0 € On, d(Cn, ©n,0) = 0 and 0 € Cy, then diam(C;) > £n. [Indeed, for 
every ô > 0 there exist points c € C, and 0, € O, o with d(c, On) < ê. By the def- 
inition of ©, 9 we have d (0,, On,1) > En and hence d(@,, 0) > €n. By the triangle 


ADAPTIVE CONFIDENCE INTERVALS 251 


inequality d(c, 0) > €n — 6.] It follows that, for every 0 € ©, 1, 
PS” (1 — bn) = PS” (d(En, 05,0) = 0) 
< P$? (diam(Cy) > ex) + P30 ¢ Cn). 


By (1.1) the second term on the right-hand side is strictly asymptotically smaller 
than œ, uniformly in 0 c ©. If the first term on the right-hand side were asymptot- 
ically smaller than B — a, uniformly in 0 € €, thus contradicting the assertion of 
the lemma, then the left-hand side would be asymptotically strictly less than f, so 
that the tests would also satisfy (6.2). |! 


To obtain a lower bound for supoco, PS” (diam(C,) > en) we can apply the 
preceding lemma with ©, | = €, but also with every subset of O. In particular, 
we may apply the lemma with a one-point set ©, 1 = {01}, for any 6; € ©). For 
regularity models ©, Ingster [23] characterizes the minimax rate for exactly these 
one-point problems. He shows that there exists a rate e* such that the sum of 
the error probabilities (6.1)-(6.2) goes to zero if £,/£7 — oo and goes to 1 if 
£n/£5 — 0. Thus the condition of the lemma is satisfied for any 0 «o < B «1 
with a + B < 1 and e, with e,/e* — 0. The lemma then says that the weak limit 
points in [O, co] of the distribution of diam(C,) /e5; have a component of size at 
least B — a concentrated on (0, oo]. In other words, the order of the diameter is at 
least ež. 

The relationship between the diameter of confidence regions and the minimax 
rate for estimation is less perfect, due to the fact that the risk for estimation con- 
cerns the complete distribution of an estimator, whereas a confidence region at 
level 1 — a leaves a mass of size æ completely undiscussed. 

A key result is as follows. Let B > 0 be given, and let £4, be a sequence of 
positive numbers such that for every estimator sequence T, 


a2, ds (n) 
(6.3) porum P5 (d(Ta, 0) > En) > f. 


LEMMA 6.2. For given 0 « a « B « 1, if (6.3) holds for every estimator 
sequence T,, then for any sequence of confidence sets C, satisfying (1.1), 


liminf sup Pf" (diam(C,) > En) > fa. 
n0 9e@, 


PROOF. Given a sequence of confidence sets C,, define for each n an estima- 
tor 7, to be an arbitrary point in C,,. Then, for any 0 € 6, 


PS” (a(7,,0) > €n) < PS” (diam(C,) > en) + Pf? (6 ¢ Cy). 


By (1.1) the second term on the right-hand side is asymptotically smaller than o, 
uniformly in 0 € ©. By assumption the liminf of the supremum of the left-hand 
side over 0 € ©, is bounded below by B. G 


294. J ROBINS AND A. VAN DER VAART 


If we choose e, faster than the minimax rate, then typically (6.3) holds for some 
P > 0. In particular this is true if the minimax rate ef has the property that for a 
"best" estimator sequence T, the sequence d(7,,0)/e7 has all its limit points on 
(0, co]. In that case d (75, 0)/En — œ, and the right-hand side of (6.3) is 1, for any 
sequence €n with £,/e7 — 0. We may then apply the lemma with any f < 1. More 
generally, this argument works if the weak limit points of the sequence d (Ta, 0)/67 
in [0, co] possess a point mass of at most f at 0. 


REFERENCES 


[1] BARAUD, Y. (2004). Confidence balls in Gaussian regression. Ann. Statist. 32 528—551 
MR2060168 
[2] BARRON, A., BIRGE, L. and MASSART, P. (1999) Risk bounds for model selection via pe- 
nalization Probab. Theory Related Fields 113 301—413. MR1679028 
[3] BARRON, A. R. and COVER, T. M. (1991). Minimum complexity density estimation. IEEE 
Trans. Inform Theory 37 1034-1054. MR1111806 
[4] BERAN, R. (2000). REACT scatterplot smoothers: Superefficiency through basis economy. 
J. Amer. Statist. Assoc. 95 155-171 MR1803148 
[5] BERAN, R and DUMBGEN, L (1998). Modulation of estimators and confidence sets. Ann. 
Statist. 26 1826-1856. MR1673280 
[6] BICKEL, P. J. and RiTov, Y (1988). Estimating integrated squared density derivatives: Sharp 
best order of convergence estimates. Sankhyd Ser. A 50 381-393 MR1065550 
[7] BIRGE, L (2002). Discussion of “Random rates in anisotropic regression,” by M. Hoffmann 
and O. Lepski Ann Statist. 30 359—363. MR1902892 
[8] BIRGE, L. and MASSART, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3 203—268. 
MR1848946 
[9] BRETAGNOLLE, J. and HUBER, C. (1979) Estimation des densités Risque minimax 
Z. Wahrsch. Verw. Gebiete 47 119-137. MR0523165 
[10] BROWN, L. D., CARTER, A. V., Low, M. G and ZHANG, C.-H. (2004). Equivalence theory 
for density estimation, Poisson processes and Gaussian white noise with drift. Ann, Statist. 
32 2074-2097 MR2102503 
[11] CAI, T. and Low, M. (2006). Adaptive confidence balls. Ann. Statist. 34 202-228 
[12] DoNoHO, D. L and JOHNSTONE, I. M. (1995). Adapting to unknown smoothness via wavelet 
shrinkage. J. Amer. Statist Assoc. 90 1200-1224. MR1379464 
[13] DoNonHO, D. L. and JOHNSTONE, I. M. (1994) Ideal spatial adaptation via wavelet shrinkage. 
Biometrika 81 425—455. MR1311089 
[14] DoNono,D L., JOHNSTONE, I. M., KERKYACHARIAN, G. and PICARD, D. (1995). Wavelet 
shrinkage: Asymptopia? (with discussion). J Roy. Statist. Soc. Ser B 57 301—369. 
MR1323344 
[15] DoNoHo, D L., JOHNSTONE, I. M , KERKYACHARIAN, G and PICARD, D. (1996). Density 
estimation by wavelet thresholding. Ann. Statist. 24 508-539 MR1394974 
[16] EFROMOVICH, S. YU. and PINSKER, M. S. (1984). Learning algorithm for nonparametric 
filtering. Autom. Remote Control 11 1434—1440 
[17] FAN, J. (1991). On the estimation of quadratic functionals. Ann. Statist. 19 1273-1294. 
MR1126325 
[18] GENOVESE, C. R. and WASSERMAN, L (2005). Confidence sets for nonparametric wavelet 
regression. Ann. Statist 33 698—729. MR2163157 
[19] GOLUBEV, G K (1987). Adaptive asymptotically minimax estimates of smooth signals. Prob- 
lems Inform. Transmission 23 57-67 MR0893970 


[20] 


[21] 


[22] 


[23] 


[24] 
[25] 
[26] 
[27] 
[28] 
[29] 
[30] 
[31] 
[32] 
[33] 
[34] 
[35] 
[36] 
[37] 
[38] 


[39] 


ADAPTIVE CONFIDENCE INTERVALS 253 


HOFFMANN, M. and LEPSKI, O. (2002). Random rates in anisotropic regression (with discus- 
sion) Ann. Statist. 30 325—396. MR1902892 

IBRAGIMOY, I. A. and KHASMINSKII, R. Z. (1980) Asymptotic properties of some nonpara- 
metric estimates in Gaussian white noise. Proc Third Summer School on Probab. Theory 
and Math Statist. Varna 1978. 

IBRAGIMOV, I. A. and KHASMINSKII, R. Z. (1981). Statistical Estimation. Asymptotic The- 
ory. Springer, Berlin. MR062032] 

INGSTER, YU. I (1993). Asymptotically minimax hypothesis testing for nonparametric al- 
ternatives. J, II, ITI. Math. Methods Statist. 2 85-114, 171-189, 249-268 MR1257978, 
MR1257983, MR1259685 

INGSTER, YU. I. and SUSLINA, I. A (2003). Nonparametric Goodness-of-Fit Testing Under 
Gaussian Models. Lecture Notes in Statist. 168. Springer, New York. MR1991446 

JUDITSKY, A and LAMBERT-LACROIX, S. (2003). Nonparametric confidence set estimation 
Math. Methods Statist 12 410-428. MR2054156 

LAURENT, B. (1996). Efficient estimation of integral functionals of a density Ann. Statist. 24 
659—681. MR1394981 

LAURENT, B (1997). Estimation of integral functionals of a density and its derivatives. 
Bernoulli 3 131—211. MR1466306 

LAURENT, B. and MASSART, P. (2000). Adaptive estimation of a quadratic functional by 
model selection. Ann. Statist 28 1302-1338. MR1805785 

LEPSKII, O. (1990). On a problem of adaptive estimation in Gaussian white noise. Theory 
Probab. Appl. 38 454—466 MR1091202 

LEPSKII, O. (1991). Asymptotically minimax adaptive estimation. I Upper bounds. Optimally 
adaptive esumates. Theory Probab Appl. 36 682-697. MR1147167 

LEPSKII, O (1992). Asymptotically minimax adaptive estimation. IT. Schemes without optimal 
adaptation Adaptive estimates. Theory Probab. Appl. 37 433-448. MR1214353 

Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17 
1001-1008. MR1015135 

MIKOSCH, T. (1993). A weak invariance principle for weighted U -statistics with varying ker- 
nels. J. Multivariate Anal 47 82-102. MR1239107 

NUSSBAUM, M. (1985). Spline smoothing in regression models and asymptotic efficiency 
in L5. Ann. Statist 13 984—997. MR0803753 

NUSSBAUM, M. (1996). Asymptotic equivalence of density estimation and Gaussian white 
noise. Ann. Statist. 24 2399-2430. MR1425959 

PINSKER, M (1980). Optimal filtering of square-integrable signals 1n Gaussian noise. Prob- 
lems Inform Transmission 16 120—133. MR0624591 

STONE, C. J. (1984) An asymptotically optimal window selection rule for kernel density esti- 
mates. Ann. Statist. 12 1285-1297. MRO760688 

TSYBAKOV, A. (2004) Introduction à l'estimation non-paramétrique Springer, Berlin. 
MR2013911 

VAN DER VAART, A. W. (1998). Asymptotic Statistics. Cambridge Univ. Press MR1652247 


DEPARTMENT OF EPIDEMIOLOGY DEPARTMENT OF MATHEMATICS 
HAEVARD SCHOOL OF PUBLIC HEALTH VRUE UNIVERSITEIT AMSTERDAM 
677 HUNTINGTON AVENUE DE BOELELAAN 1081A 

BOSTON, MASSACHUSETTS 02115 1081 HV AMSTERDAM 


USA 


THE NETHERLANDS 


E-MAIL: robins G hsph.harvard.edu E-MAIL aadGcs vu.ni 


The Annals of Ssansncs 

2006, Vol 34, No 1, 254—289 

DOI 10 1214/009053605000000769 

© Institute of Mathemabcal Statistics, 2006 


SERIAL AND NONSERIAL SIGN-AND-RANK STATISTICS: 
ASYMPTOTIC REPRESENTATION AND 
ASYMPTOTIC NORMALITY 


BY MARC HALLIN,! CATHERINE VERMANDELE AND BAS WERKER 


Université Libre de Bruxelles, Université Libre de Bruxelles 
and Tilburg University 


The classical theory of rank-based inference is entirely based either on 
ordinary ranks, which do not allow for considering location (intercept) pa- 
rameters, or on signed ranks, which require an assumption of symmetry. If 
the median, in the absence of a symmetry assumption, is considered as a lo- 
cation parameter, the maximal invariance property of ordinary ranks 1s lost 
to the ranks and the signs This new maximal invariant thus suggests a new 
class of statistics, based on ordinary ranks ard signs An asymptotic repre- 
sentation theory à la Hájek is developed here for such statistics, both in the 
nonserial and in the serial case. The corresponding asymptotic normality re- 
sults clearly show how the signs add a separate contribution to the asymptotic 
variance, hence, potentially, to asymptotic efficiency. As shown by Hallin and 
Werker [Bernoulli 9 (2003) 137-165], conditioning in an appropriate way on 
the maximal invariant potentially even leads to semiparametrically efficient 
inference. Applications to semiparametric inference in regression and time 
series models with median restrictions are treated in detail in an upcoming 
companion paper. 


1. Introduction. The classical theory of rank-based inference is entirely 
based either on ordinary ranks or on signed ranks. Ranks indeed are maximal in- 
variant with respect to the group of continuous order-preserving transformations, 
a group that generates the null hypothesis of absolutely continuous independent 
white noise (no location restriction), whereas signed ranks (1.e., the signs along 
with the ranks of absolute values) are maximal invariant under the subgroup that 
generates the subhypothesis of symmetric (with respect to the origin) independent 
white noise. 

Now, in most statistical models a location parameter for the error term is usually 
specified to be zero: regression and analysis of variance models, stationary autore- 
gressive moving average (ARMA) models and so on. Symmetric white noise al- 
lows for such an identification, at the expense, however, of a symmetry assumption 


Received February 2003; revised August 2004 
l Supported by an I A P grant of the Belgian Federal Government and an Action de Recherche 
Concertée from the Communauté française de Belgique. 
AMS 2000 subject classifications. 62G10, 62M10 
Key words and phrases. Ranks, signs, Hájek representation, median regression, median restric- 
tions, maximal invariant. 


254 


SIGN-AND-RANK STATISTICS 235 


that in practice is often quite unrealistic. In addition, the trouble with independent 
white noise without further restrictions is that it does not allow for identifying any 
location parameter. 

This location parameter in most applied work is the mean—-a heritage of 
Gaussian models—but could be the median as well. Zero-median noise is cer- 
tainly as natural as zero-mean noise. In a semiparametric context, it is even more 
satisfactory, because it does not require any moment assumption on the densities 
under consideration. Median regression and autoregression models have, there- 
fore, recently attracted much attention: see, for instance, [12, 14, 15, 17, 21], to 
auote only a few. Moreover, from the point of view of statistical inference, the 
assumption of zero-median noise is also more convenient, since it induces more 
structure. The hypothesis of zero-mean white noise indeed is not invariant under 
any nontrivial group of transformations, so group invariance arguments cannot be 
invoked in models that involve zero-mean noise. The situation is quite different for 
the hypothesis of zero-median noise, which is generated by the group of all contin- 
uous order-preserving transformations g such that g(0) = 0. A maximal invariant 
for this group is the vector of ordinary ranks, along with the vector of signs. Hallin 
and Werker [11] have shown that, in such a situation, semiparametric efficiency 
is achieved by conditioning with respect to a maximal invariant. Maximality of 
the invariant here is essential: conditioning, for example, on the ranks when the 
signs and ranks, not the ranks alone, are maximal invariant generally induces an 
avoidable loss of efficiency. 

Invariance and semiparametric efficiency arguments in such models thus lead 
to the new concept of sign-and-rank-based statistics, which involve both signs and 
ranks. This new concept is more natural than the traditional rank-based one in all 
models that include a location (intercept) parameter, but also in models such as 
stationary ARMA models, where the noise is inherently centered. The objective of 
the present paper is a detailed study of the class of linear sign-and-rank statistics 
for which we provide Hájek-type asymptotic representation and asymptotic nor- 
mality results. These results readily allow for building new rank-based tests for a 
variety of problems in one-, two- and k-sample location, regression, ARMA and 
related models without making any symmetry assumptions on the underlying er- 
ror densities. They also form a basis for the construction of semiparametrically 
efficient procedures in median constrained models (see [10]). 

The paper is organized as follows. Section 2 briefly introduces several concepts 
of white noise: independent, independent with zero mean, independent with zero 
median and independent symmetric white noises. We recall how the invariance 
principle for each of these concepts, but for white noise with zero mean, leads to 
a different concept of ranks and/or signs—the right concept for median-centered 
white noise being the signs and ranks. Sections 3 and 4 propose a systematic in- 
vestigation of (linear) nonserial and serial sign-and-rank statistics. These new sta- 
tistics, which are measurable with respect to the vectors of ranks and signs, are 
studied along the same lines as the classical linear rank statistics (see, e.g., [3] for 


256 M. HALLIN, C. VERMANDELE AND B. WERKER 


the nonserial context; see [5] and [7] for the serial context) and the linear signed- 
rank statistics (see [3] and [13] for the nonserial context; see [7] for the serial 
context). However, the nonindependence between the ranks and the signs (in sharp 
contrast with the traditional context of signed ranks, where the signs and the ranks 
of absolute values are mutually independent) requires a more delicate treatment. 
Section 5 concludes with an empirical study: simulations very clearly show that 
the proposed procedures quite significantly outperform their classical counterparts 
based on either parametric correlograms or traditional ranks—the more skewed the 
underlying densities, the more significant the efficiency gain. 


2. White noise and group invariance. 


2.1. White noise and semiparametric statistical models. Whatever the concept 
of ranks, rank-based inference applies in the context of semiparametric models 
under which the distribution of some observed n-tuple yo .— (Y a shes yy 
belongs to a family of distributions of the form 


(2.1) (PpO OCR, fe F}, 


where 0 denotes some finite-dimensional parameter of interest and f denotes 
some unspecified density (densities throughout are tacitly taken with respect to 
the Lebesgue measure over the real line) that plays the role of a nonparametric 


nuisance. This distribution P in general, is described by means of (1) a residual 
function, namely, a family By 0 c O} of invertible functions indexed by n and 0 
that map the observation Y? onto an n-tuple of residuals 
30 (Y) =Z 6) := (Zi), ..., ZM OY, 
and (ii) a concept of white noise with (marginal) density f such that Y? has 
distribution P? iff Z (6) is white noise with (marginal) density f. 
We concentrate on four particular forms of white noise. Define F :— (f: f(x) > 


0, x € R} as the set of all nonvanishing densities over the real line, let F} :— 
Lf € € : fo, zf (z) dz = 0) be the subset of all densities in F with zero mean, 


let Fo := (f € Ff fo dz = fy” f(z)dz = 1/2) be the set of densities 

in F having zero median and let F4} := (f € £:f(—z) = f(z),z € R} be the 

set of densities in F that are symmetric with respect to the origin. Denote the 
following terms: 

(a) Independent white noise: Let je denote the hypothesis under which the ran- 
dom vector Z™® = (Z E ..., ZY’ is a realization of length n of an indepen- 
dent white noise; that is, zw „i = 1,..., n, are 1i.d. with density f € F. 

(b) Zero-mean independent white noise: Let a denote the hypothesis under 
which Z® is a realization of length n of an independent with zero-mean white 
noise; that is, zu i —]1,...,n,areil.d. with density f € Fy. 


SIGN-AND-RANK STATISTICS 257 


(c) Zero-median independent white noise: Let H P denote the hypothesis under 
which Z/? is a realization of length n of an independent with zero-median 
white noise; that is, zm i = 1,...,n, are i.i.d. with density f € Fo. 

(d) Symmetric independent white noise: Let HY Lf denote the hypothesis under 
which Z®® is a realization of length n of an independent symmeiric white 
noise; that is, Z;”),i=1,...,n, are iid. with density f € Fy. 


The notation Je? , 72, 3) and HF ) is used whenever the underlying den- 
sity function f remains unspecified within F, F, Fo or F+, respectively. In prac- 
tice, of course, the role of the random variables Z9) is played by the residuals 
Ak (0) (i =1,...,n) associated with a specific value 0 of the parameter in the 
statistical model under consideration. 

The independent white noise hypothesis #0 is most general, but does not 
allow for identifying location parameters. A classical attitude, when location is 
to be identified, consists in assuming that the underlying white noise density has 
zero mean, that is, adopting 3€. As already explained, an often-used alternative 
solution requires the median (instead of the mean) of the white noise density to be 
zero, leading to Ho ”) The additional assumption of symmetry yields HM., 


2.2. Group invariance: ranks, signed ranks, and signs and ranks. Let 8? := 
(R^, g^, P™ .— 10d 0 € ©, f € F}) be characterized (in the sense of Sec- 
tion 2.1) by the residual function 30) and the white noise concept J£ ^. Denote 
by © the set of all continuous, strictly monotone increasing functions g :IR — R 
such that limy., +g (x) = +o, define 90. z—(zy...,z)) € R” > g CA = 
(g(zi), .... 8(Zn)Y € R” and consider the group (acting on R”) 


95 [35 ) 89035", 8 EG) o 


This group (known as the group of order-preserving transformations of 
residuals) clearly is a generating group for the fixed-@ submodel €™ (0) := 
(R^, B", P™ (0) := (Pj, f € F}) of €™, with maximal invariant the vector 
RU?) (8) :— (RU? (0), ..., RP? (0))', where R® (0) denotes the rank of the residual 
Z™ (8) among Z™ (0), ..., Z? (0). 

similarly, let 64 :— (g € 6:g(—z) = —g(z)) and denote by $$. the corre- 
sponding subgroup of 949. This group (the group of symmetric order-preserving 
transformations of residuals) is a generating group for e™ @) .— (R^, B”, 
PO) .— (Pj. y. f € £4), the submodel of 8? (0) that results from restrict- 
ing to symmetric densities f € s À maximal invariant here is the vector 
RY? (0) := (sp (0) RE) (0), ..., 56 (0) RE? (0))', where R™ (8) denotes the rank 


258 M. HALLIN, C. VERMANDELE AND B. WERKER 


of the absolute value |Z% (0)| among |ZV? (8)], ..., IZIP (8)| and where s" (8) 
is the sign of Zz). 

Turning to the model gi” .— CR Be, Ka) = ee 0 € ©, f € Fo}) char- 
REA m the residual function 3 ind the zero-median white noise con- 
cept Hy” y, it is easy to see that a generating group for (with obvious notation) 
a”) (0) is obtained by considering the subgroup of a that corresponds to 69 := 
{g € 6:g(0) = 0), with maximal invariant the vectors s"? (0) :— Ge (8),..., 
se (0))' of residual signs and R(? (0) of residual ranks. 


Provided that the parameter 0 contains a location or intercept component, and 
leaving aside the condition that residuals should have finite first-order moments, 


the model 6” .— (R^, B”, PIP .— dos 0 € O, f € Fa}), which is character- 
ized by the same residual function 3" as e , but has zero-mean rather than 


zero-median white noise, coincides with a Both models indeed involve the 
same family of distributions P™ over (R^, B”); they only differ in the way 
the nonparametric family P“ is split into a collection of parametric subfamilies 


p = [pro 0 € 8) (hence, of course, in the way 6 is to be interpreted). Rather 


iu two distinct models, B and e" thus constitute two different parametriza- 
tion of the same model, but the invariance structure underlying e" is not present 


in go. The median, in this respect, allows for a richer structure and, therefore, 
seems more appropriate than the mean as a location parameter. 


2.3. Group invariance and semiparametric efficiency. The importance of con- 
sidering maximal invariants—thus, signs and ranks in models with zero-median 
white noise—has been substantiated by Hallin and Werker [11]. Their paper 
showed that, in a very broad class of models, semiparametrically efficient infer- 
ence procedures can be obtained by conditioning with respect to a maximal invari- 
ant o-algebra. 

More precisely, assume that the semiparametric family (2.1) 1s such that: 


(1) For any fixed f, the parametric subfamily pe Ep, 0 € Q} is locally 


asymptotically normal (LAN), with central sequence AC (8). 


p”. _ po) 


(ii) For any fixed 0, the nonparametric subfamily P f; f € F} is gener- 


ated by a group of transformations with cA invariant W (8). 


Then, under very general conditions, semiparametrically efficient inference (test- 
ing, estimation, etc.) at f can be based on the semiparametrically efficient central 
sequence EA’? (0)| WC? (8)], which, moreover, is distribution-free under $20 
Projecting oio maximal invariant o-algebras (generated, in the context of Sec: 
tion 2.2, by the ranks, the signed ranks or the signs and ranks) thus yields 


SIGN-AND-RANK STATISTICS 259 


(at given f) the same results as tangent space projections. In a companion pa- 
per [10], we specialize the Hallin and Werker [11] abstract results to obtain semi- 
parametrically efficient inference in median regression and autoregressive models 
using the asymptotic representation results of the present paper for general sign- 
and-rank statistics. 

Inference based on ranks and signed ranks has since long ago made its way to 
everyday practice and even to elementary textbooks. À pretty complete toolkit of 
rank-based methods is available for the analysis of linear models with independent 
observations (see [4, 18] for a systematic account and the state of the art in this 
context), as well as for the analysis of linear time series models (see [2, 5—7, 9]). 
It is somewhat surprising, therefore, that sign-and-rank statistics never have been 
considered so far in the vast literature devoted to that subject. The purpose of this 
paper is to fill this gap. 


2.4. Two simple examples. Two examples are treated in some detail in Sec- 
tions 3.4 (median regression) and 4.4 (median moving average), respectively. 
Under the median-regression model, observations are of the form 


(2.2) Y =0 64) rs, — i-2l.an 


where 0 := (01,65) € R*, the c's are regression constants and the ¢é,’s are in- 
dependent and identically distributed (i.i.d.) with density f. Instead of the usual 
specification that E[s,] = 0, however, we rather impose that the median of e; is 
zero (i.e., f € Fo). Here, the residuals take the form z" (0) :— Y, — 01 — bc”, 
Under PẸ}, these residuals are ii.d. with density f € Fo. Under fairly general 


conditions, this model, for fixed f (with weak derivative f^), is LAN with central 
sequence 


(2.3) AC (6) = By f zo) (. m). 


In the first-order median moving average (MA) model, observations are gener- 
ated by the MA equation 
(2.4) Y; = €; + 06,1, Peedi. ncn: 


with 9 € (—1, 1). Here again, we assume that the e;'s are independent and identi- 
cally distributed with density f and median zero. For pare assume £p = Q. 
The residuals are defined recursively as ZU? (0) := Y, — 9Z™, (6), with initial 


value Z5" (0) = 0. Here again, for fixed f (with weak iia f^), LAN holds 
with central sequence 


(2.5) AT (9) := eic F (zv zt o». 


260 M HALLIN, C VERMANDELE AND B WERKER 


2.5. Sign-and-rank statistics. A sign-and-rank statistic is an (s? , R?)- 
measurable statistic, where s? = (sonar) and R(? = R”, LL Ry 
are the vector of signs and the vector of ranks, respectively, associated with some 
n-dimensional random vector Z). The objective of this paper is to introduce lin- 
ear nonserial (Section 3) and linear serial (Section 4) sign-and-rank statistics, and 
to study their distributions under J£; id 

Denote by 


n H 
N® y [z® <0] 2 Y s” 2 —1] 
r=] 


and by 


n n 
NO? =). [z > 0] = ` I[s = 1] 
i=] i=] 

the numbers of negative and positive components in Z (in s), respectively. 
Under #”, NV? is binomial Bin(n, 1/2). Letting N® := (NV?, NIP), note that 
c (N??) = o (Ny = o (Nf?), because NS” =n — NČ with probability 1. Since 

s™® =7(Z > 9] = rz” < 0]= mo >n — N]— I[R® < N for all 
i=1,...,n, the couple (N™, R(?) is maximal invariant for HP., 

Defining the sets 


WO = [i e (1... ns =I [ip eis] 


I 
and 


AO = [i e (1,... n) is P 2 1] = (it unit. 

N4 
the distribution of (s?, RC?) under He is conveniently characterized as 
follows: The marginal distribution of sU? is uniform over the 2” elements 
of {—1,1}" and the conditional distribution of R® given s? is such that 
(Re, RO T OUT us Ripe RY ) is (conditionally) uniformly dis- 


i, , 
aoo y? 
tributed over the (NV? !) (NU n possible combinations of a permutation of 
Us .., N®) — of ((n — NT). 1:553571]. 


Let us finally denote by 2 2) and 20 
sociated with the negative and positive elements of Z™ , respectively. These two 
vectors—the first one of length NÉ? and the second one of length N™ constitute 
a natural (random) decomposition of the vector of order statistics "^h associated 


with Z™, 


the vectors of order statistics as- 


SIGN-AND-RANK STATISTICS 261 


3. Nonserial linear sign-and-rank statistics. 


3.1. Definition and conditional asymptotic representation. A linear nonserial 
sign-and-rank statistic is a statistic of the form 


(3.1) Se» = - Lys MIN”; R”), 
aes 1 


where a“)(.;-) is a real-valued score function defined over {((v,7);i):v,n € 
{0,1,...,2}, 7<n—v,ié {1,...,m}}; note that each summand in (3.1) is al- 
lowed to depend on the sign jx of AA but also, via NV", on the other signs, 
but not on the other ranks. As usual, the a 's (i — 1,...,n) denote nonrandom 
regression constants. 

The exact mean E[SU?] and the exact variance Var[S4?] of SU? under HP 
are easily obtained from elementary combinatorial arguments: Letting c? :— 

noy eg cU) we obtain 


E[s?] = (n2^)-12 ey Ja O20) 


j=lv=0 
and 
1 n 
Varl $0? ] = (n) | =(n)\2 
ar[Se ] n(n Es 1)27 Lc C ) 
n n n 
«X (7) Dass ea 
yz i=l 
fe ; 
;| Sa oon — v); | l 
IE ere 
respectively. 


If asymptotic results are to be obtained, some stability of the scores a^ is 
required as n increases. We therefore assume the existence of a score-generating 
function. A function g: (0, 1) — R is called a score-generating function for the 
score function a™ if 


3.2) E[[a 9 (N9?; nf?) — e(F(Z/?)) I0] = op) 

under s as n — oo. Here F denotes the distribution function associated with 
density f. Note that, by the rule of iterated oa and the fact that NU? = 
(N99. N9j is measurable with respect to ze , a sufficient condition for (3.2) to 


262 M. HALLIN, C. VERMANDELE AND B. WERKER 
hold is 
(3.3) E[{a™(N™; RY”) — o(F(Z{”)) PIN] = op (1) 


under Hy” p, aS N— oo. 


No asymptotic results for SÉ? 


the asymptotic behavior of regression constants c, 
the classical Noether condition holds: 


can be obtained without some assumptions on 
() i —],...,n. We assume that 


(N) The constants eu i — 1,...,n, are not all equal and 


is maxi <isn (ct? — c0 _ 
noo f _ (c™ — aln))2 u 


We may now state a first asymptotic representation and asymptotic normality 
result. This result, however, is a conditional one in the sense that the centering in 
(3.4) and (3.5) below is a conditional centering. Since, conditionally on the signs, 
the sign-and-rank statistic (3.1) reduces to a purely rank-based statistic, this con- 
ditional representation result follows from classical results on linear rank statistics 
and merely serves as an intermediate step in the derivation of the main result (of 
an unconditional nature) in Section 3.3. Contrary to the unconditional one, which 
requires exact or approximate scores, the conditional result holds for any scores 
that satisfy (3.2). 


LEMMA 3.1. Let q:(0, 1) — IR be a nonconstant square- ee score- 
generating function for a™ and let the regression constants ce ) (i = —— a — n 


satisfy the Noether condition (N). Assume moreover that ANC a gy. = 
O(n), as n — co. Then: 


(1) (Asymptotic representation) under P f, aS n — OO, 
(3.4) s? —E[S@ IN] = TL), — BLT IZ] + op(1/Vn), 


where T := ^ Lc p(F(ZU?)) (F stands for the distribution function as- 


n im 


sociated with 7 y 
(ii) (Asymptotic normality) under Jf, () as n — oo, 


Le " Í 
(3.5) Jn(sf? — E[s(?|N9?]) / : y (c? — ey? > N (0, 02), 
i=) 


where 0 < p ios J o° (u) du — (D y(u) du)? < oo. 


SIGN-AND-RANK STATISTICS 263 


Observe that, under 7¢™, 


Es” IN] = X a (N®. R™)|s] IN] 


NO 


1 n 
- UL = —1[N9)] — x i; en NO? j) 


1 n 
+ P[s” = JN? m X a”) (N™; j) 


+ jy=(n—N™)41 
1 n 
=a (1 a (NO: j ) 
n 
j=l 


" Ix 
= aP (LF aN; ato) 


i=] 





and 


l n 
ere) = 5 PEE EOZ] 
i= 


(3.6) s 
= g™ (2 Yr). 
[ssi 
Hence, part (1) of Lemma 3.1 actually states that 


33: (n) — g™) a ? (N99. R”) 
(3.7) 
=F LU E™)9(F(Z;”)) + op(1//n), 


under HE P as n — oo. Note that the expression on the right-hand side of (3.7) 
coincides with the asymptotic representation of the purely rank-based statistic 
1y e eae (R®), where at (a? ) are, for instance, the traditional 
exact scores E[g(F (Z&)y | RO] associated with the score-generating function g. 
The sign-and-rank statistic S” thus asymptotically decomposes into two parts; 
one of them (namely, S¢ (n) grs? [N(9]) asymptotically does not depend on N(? 
and represents the contribution of the ranks, while the second one (EL SU? ING] — 


ELSE?) constitutes the contribution of the signs. Moreover, the ranks and N (^) be- 
ing mutually independent, these two quantities are orthogonal to each other and 


264 M HALLIN, C VERMANDELE AND B WERKER 


contribute additively to the unconditional asymptotic variance (see the proof of 
Proposition 3.2 below). 


PROOF OF LEMMA 3.1. Since the ranks R and N“ are mutually indepen- 
dent under Ms part (1) of the lemma follows from classical asymptotic represen- 
tation results for linear rank statistics; see [3], page 61. The proof of part (ii) of the 
lemma, in view of (3.4), simply consists in checking that ./n s — EITS” AZ D 
satisfies the traditional Lindeberg condition. U 


3.2. Exact and approximate scores. Following the classical literature on 
ranks, we consider in the present paper sign-and-rank statistics based on either 
exact Or approximate scores. 

Let Oe be an n-tuple of ii.d. random variables uniformly dis- 
tributed over (0, 1). Define sp := HUP > 1/2] — IU < 1/2], NG? :— 

my TUS” « 1/2] and NQ?, := SL, HU? > 1/2]. Denote by RẸ? the rank of 
ye among U LAN ,U [S by U (1 — 1,..., v) the ith-order statistic associ- 
ated with : PUDE of v i.i.d. random variables uniformly distributed over (0, 1/2) 
and by Ur + (i =1,...,v) the ith-order statistic associated with a sample of v 
iid. random vatables ohifobmly distributed over (1/2, 1). bos that the condi- 
tional distribution of p given the event SU e = —] (resp. sọ e) = 1) is uniform 
over (0, 1/2) [resp. (1/2, 1)]. The linear nonserial sign-and- xum statistics con- 
structed from the exact and approximate scores associated with o are defined by 


(n) (n) |) . pM) 
Se ¢,g,ex/appr ‘~~ ax do, caver (NT R, ) 

] S^ L0) rp 00 e) (n), p(n) 

(3.8) = »" re Uls, x =l ja; -— E NS ; R, ) 
[esl 
(n) (n) (n). pm) (n) 
+ Hs, = lja ae eee (Le 34 seme DE 

where the score functions ds oh i oss d. ox and a. appr’ all defined on 
the set ((v; i); v, i € (1, ...,n] with < v], are given by 


(3.9) Qu i) i E[o(Uf ND =v, RE = i] = E[e(Uc )]. 





(n) EUN @) 7) _ i 
(3.10) a. appr (D PELU- = "s m 5) 
OaD EPIN, m en] 
(3.11) 


pee 


SIGN-AND-RANK STATISTICS 265 





and 
(n) M o) X 2 E i 
(3.12) 85 ap (Ys D = e(B[UG ]) = (5 TOO 5) 
Observe that, under Hy», SS? ex = EITE IN® , R(?] = EIT |s, RO], 


We then have the following proposition. 


PROPOSITION 3.1. Let g:(0,1) — R be a nonconstant square-integrable 

; : ; . (n) : A 
function. Then q is a score-generating function for ex: If, moreover, q is the dif- 
ference of two nondecreasing square-integrable functions, then q is also a score- 


. . (n) 
generating function for Qo appi 


PROOF. Let us first consider the exact scores defined by relationships (3.8), 
(3.9) and (3.11), and let us show that, under 363”, 


(3.13) El {apron (NO); RY”) — p(F(Z{”)) PIN] = op(1) 
(n) 


as n — co. By the definition of 5. exo WE only need to show that 


E[(E[e(F(Z1?))s1? 2 —1, NP, RY] - e(r(Z(?))^|N9), sf? = —1] 
= op(1), 


under us as n — oo. Since F (Zi) is, under aa and conditionally on 
s? = —1, uniform over the interval (0, 1/2), this readily follows from a slight 
generalization of Theorem V.1.4.a in [3], page 157. 

Let us now consider the approximate scores defined by (3.8), (3.10) and (3.12). 


(n) : (n) 
Clearly, (3.3) holds for a oappi if, under Ho: f 


2 
EL fa app (NP: RY) — (F (29) N99, s? = —1] 


and 


2 
EL fag: agp (NE^: Ry” — (2 — NY) - o(F (ZY?) VINE”, si = 1] 


are op(1) as n — oo. We have 


E[[af? app (N99; RP) - o( F(Z) INO, s(9 = —1] 


P, —; appr 
= ELL (ap app (NT: RP) — ag (NT; RY”) 
+ (agia (Ns RP) - (FI) IND sP =-1] 


< Ella app (N; RY") cae NT ERE] INS ssp = —1] 


2E [[ag). (Ns RM) — e(F(ZIP)) IN, sf =-1]. 


266 M. HALLIN, C VERMANDELE AND B WERKER 


In view of the result for exact scores, we just consider the second term. Denoting 
by [x | the integer part of x (x € R*), we may write 


EL fap. app (N: RIP) — age (NT 5 RY” )P INS ssp 71] 





$,-7,8ppr 
ms 
n n). 432 
= ves 2o ag ap (Ni) — ap e NOs) 
— t=] 
] 
~ Jo Ge aoe NS 1 + LW u)) — p(u/2)) 


+ (9(u/2) — a. ex NÀ 1+ [NP ul) au 


<2 [a (N99: 14 [N9)u]) — o(u/2)]^ d 
=* jg gi appr 


ER) [i (a S (NV 14 [Nu ])— o(u42)]^ du 


That this latter quantity is op(1) follows from an obvious adaptation of Lem- 
ma V.1.6.a and Theorem V.1.4.b in [3], pages 164 and 158, respectively. (J 


3.3. Asymptotic representation and asymptotic normality. We now can state, 
for the nonserial case, the main result of this paper. 


PROPOSITION 3.2. Let aoe 1) — R be a nonconstant square-integrable 
score-generating function for S : i Exon and let the regression constants ce 


(i =1,...,n) satisfy the Noether condition (N). Whenever approximate scores are 
considered, assume that q is the difference of two nondecreasing square-integrable 
functions. Assume, d that c? = O(1) and RUN eres cy? = O(n) as 
n — oo. Let up = /* ou) du, ug : = = fin y(u) du and ilọ := hi q(u) du. Then, 
writing S se for daa So ox ae 24 aont 


(i) (Asymptotic representation) under Jé!”., as n — oo, 
ymp P 0; f 


n 1 - ms 
SP - ELS] 2 - Doe” - &)o(F(Z;”)) 


i=] 
(3.14) y(t " 
reola uz 27 ui — ue | o (1/ V7); 


(ii) (Asymptotic normality) under J£? , as n > oo, 





(3.15) Jn(s& — E[S&?]) p/ tre (n) — g)? + [e (ug -uD 


£; N (0,1). 


SIGN-AND-RANK STATISTICS 267 


Note that, in case ø is spi bps with respect to 1/2 [i.e., g(u) = 
—qe(1 "o we — He = z and Lg = 0. — calculation yields 
pr ye + 27 pt — ae = EM ur (1 — gay, The conditional (3.4) 
and unconditional (3. 14) asymptotic representations thus coincide and reduce to 
Hájek's traditional one for linear rank statistics, as soon as c? = o(1) (examples 
of skew-symmetric score functions are the location scores øf :— —f’/f of a sym- 
metric distribution with density f). 


PROOF OF PROPOSITION 3.2. (i) We first establish (3.14) for exact scores. 
From (3.4) and (3.6), we have 
n 
(n) (n) i (n) zn 
e: ,ex — E[S; p; ex] = x n - (c — g! o(F(Z\”)) 


i=] 


(3.16) 

+ ES? IN] — ELSE? ex] + op (1/A/n). 
Since 

ELSE? INO] = ELETI INO, R]IN@] = B[T IN] 
ae E[E[7, js] [IN] 
=r]: Y «spo (F (ZO) PING 
ix] 

where 


2 
F[e(F(Z?)isf"] = r[s? =-1] [our 


4 I[s Men. PORIN 


= 21 [s = —1]u +2[s® = 1]Jut, 
it follows that 


E[S .,IN] = 2» “E[s” = —1]u; 


(3.17) t Hs? = Juz INC] 


(n) (n) 
pad dM aaa N 
= 26”) (=n; o uz) 
n n 





and 








(n) 
318 BIS". iN(1 gr. 1o zo( 2 a Net 
POS 2E] IN] [Scvprex] = € s Pu n U9 Hep 


C, 9;ex 


268 M HALLIN, C. VERMANDELE AND B. WERKER 


which, along with (3.16), establish (3.14) for exact scores. 

Turning to approximate scores, we can assume, without loss of generality, that 
9 is nondecreasing. Since (3.16) also holds if approximate scores are substituted 
for the exact ones, it is sufficient, so that (3.14) holds for approximate scores, to 
show that the difference 


EO = (EUS IN” ]—E = E[S. J} 


(3.19) ird Y um 
- [E [Se o exl N je E[S. exl] 
is op(1/./n ). Note that 
E[S, SN 
(n) F J 

(3.20) = CN — ap «ds )+ (Gta) 

3 aa 2 2. ANP +1) 

(n) (n) 
N N 
-80 [oA pry 428 = Dro}, 
N+ 


where D> :— =a 1 9 God) and D = 4 =1 003 + meg) re Rie 
mann sums for the integrals Hy = zs z * o(u) du and (RUE = = fi D e) du, respec- 
tively. Since q is square-integrable, any term in the Riemann sum s E pG + 
AFT associated with fij, 9!) du is o(1) as m — oo. This implies that 
i-e(4 + 3orn) is o(1/./m); henee, in view of the fact that NO = Op(n), 


this implies that a? G ES aimp op(1/./n) as n — oo. The same rea- 


soning shows that any finite sum of Riemann terms in D^ MZ or D* yo actually is 


op(1/./n) as n — œ. 
Now, any Riemann sum D+ for ud satisfies, s yp » nondecreasing, the 


m _— Dt < Dt < Di, where Dt := A DAE, e + 354) and 

= z} 2s] eG + 3c) are the upper and — Darboux sums associated 
with p y(u) du. The difference DX — D$ clearly 1s =+ (p(À + Bmp) ~ e), 
which is o(1/4/m) as m — oo. Hence, for any Riemann sum, Dj, — ug is also 
o(1/./m), so that Din — us = op(1/./n) as n — oo. 

Furthermore, since ae sequence Di — wg a to zero, it is bounded, 
so that Dw- H t is uniformly integrable and p+ jug] = 0(1/./n) 
asn—> oo. 

A similar reasoning of course holds for Din and ug . Going back to (3.20) and 


recalling that c? = O(1), we thus obtain the desired result that EU? is op(1/./n). 
This completes the proof of part (1) of the proposition. 


DW 


SIGN-AND-RANK STATISTICS 269 


(ii) As for asymptotic normality, elementary calculations yield 





NO NO 
zm(52— ,,- tout. 
Vac (29 p; +27 u — wy) 


= a) (2u; - wp (= - ;) / fios) |a. 


which, since Qe — 5) /./1/4n is asymptotically standard normal, is also asymp- 
totically normal with mean zero and asymptotic variance [c (Ms = gi. The 
remark (right after Lemma 3.1) on the orthogonality between the two parts of the 
asymptotic representation of e) completes the proof. [1] 

Test statistics related to "regression coefficients" naturally involve "regression 
constants" ce that are not all equal. Quite on the contrary, test statistics related 
to location and intercepts do not involve any constants—more precisely, they are 
still of the form Seu. but with constants ce? all equal to 1. Proposition 3.2, as it is 
stated, does not apply. However, going back to the proof, one easily checks that, 


letting pa Jappr `= 1234 a”. Jeon (n). R” )), under the same assumptions on 
the scores 9, 


(n) (n) 
Ss eases i E[S p;ex/app:l) 
(n) NO 


(3.21) —2— u; t 2— ug — Hp + op(1) 





£ - 
—» A (0, (ug — ug) 
under Je? . as n — oo. 


3.4. Example: median regression. ‘The central sequence (2.3) takes the form 


AC = AC (0) — n m fb D. fo)” with (using the notation of Section 3) 


n 
mo ..1 (n) 
T, f.i e 2, err" ) 
[= 


n 
nr) la (n) 
T, sf a" der ef (F(Z) 
Ix 
and gr(u) i= =£(F-l(u)), u € (0, 1). Instead of an arbitrary score-generating 
function, we therefore focus on gy. Define 


(n) Y (n) 
Ste ie F(T pa IN) R(?] 


270 M. HALLIN, C. VERMANDELE AND B. WERKER 


and 


s a SHT . IN™, R?]. 


Straightforward calculations lead to 


(n) (n) i) (n) 
N N — N` 
2— hg,  2—— pe, — bey = 2. (0) —— ——, 


so that EA 3.2 and "c yield 


S 


Q f ;2;ex 
under P as n — oo, where 
(n) (n) 
NIC = NL 
2f res 
AT = n NO? ON OO 


(n) _ cQ) (n) c) A cM 
2» os (F(Z,”)) + £92 50) —— 


is a version of the N efficient central sequence for 8 in the semi- 
parametric experiment s. This latter statement can easily be checked using stan- 
dard tangent space calculations. Similarly, in view of (3.5) and Proposition 3.1, 
the approximate score version of the same semiparametrically efficient central se- 
quence is 


(n) — a(n) 
Ny” — NS 
2 f (0) 
a Xn Nw? — ye 
3X4 ? — yg (g(?)) 4. 02 (9) 5 
with, for i = 1 n, 
(n) 
R 
NR ENT 
A(N +1) 


(3.22) 


(n) (n) 
1 R'"-—(n—N 

[RP >n- NP 5— 2). 
2(N i^ +1) 


This central sequence, which is measurable with respect to the residual signs and 
ranks, can be used to perform semiparametrically efficient inference (tests, estima- 
tion, etc.); see, for example, Section 11.9 of [16]. For a full treatment of sign-and- 
rank-based versions of semiparametrically efficient central sequences in median 
restricted models, we refer to [10]. 


SIGN-AND-RANK STATISTICS 211 


4. Serial linear sign-and-rank statistics. 


4.1. Definition and conditional asymptotic representation. Nonserial sign- 
and-rank statistics, just as their traditional rank-based counterparts, are inefficient 
in the context of dependent observations: Only serial statistics can capture the ef- 
fects of serial dependence. Define a linear serial sign-and-rank statistic of order k 
(k € (1,...,n— 1) as a statistic of the form 


1 n 
ecl. Y POR, n) 
tzk4-1 


where al (-; -,...,°) is defined over the product of the set {(v,7);v,7 € 


(0,1,...,n), n <n — v] with the set of all (k + 1)-tuples of distinct integers in 
ds E The asymptotic mean and variance of Cun 
Proposition 4.1. 

Here also an asymptotic representation result is proved, establishing the asymp- 
totic equivalence between 5, n) and a "parametric" serial statistic p ) The asymp- 
totic normality of p then entails that of 2. A function ø: (0, 1)*t! > Risa 
score-generating function for the serial score function a; if 


are given in the subsequent 


n n 2r, (n 
elle NO; RBS... RO) - PCI. FCD] 
— op(1) 
(n) 


under Hy f» aS n —> oo. Once more, (4.1) automatically holds if, under Feo p 


(4.1) 


n n n 2 n 
Ef fal (Ns RO, ..., RP) — o (F(Z(2.). .... F(Z) PIN 
= op(1) 


as n — oo. We then have the following conditional asymptotic representation and 
asymptotic normality results, which are the serial counterpart of Lemma 3.1. 


(4.2) 


LEMMA 4.1. Let qV:(0, D! > R bea score-generating function for d. 
Then: 


(1) (Asymptotic representation) under He p USN —> oo, 
(4.3) SP -E[SPIN?] = T5). — EITS lZ] + op(1// 2), 


where 


1 
TE k= —— oR (F (ZI), FO) 


272 M. HALLIN, C. VERMANDELE AND B WERKER 


E[T,7 7.41205] 


-[nn-D--(Q—-5r! Y--Y? e(F(ZÉ),..., F(Z); 


Ixtz Fk} Sn 
(ii) (Asymptotic normality) if, moreover, 0 < fop lex (uk ets. 
u1)|?*3du, +++ dupa < oo for some 8 > 0, then, under 3€? , as n — oo, 
L 
Jn —k(SP — E[SP? |N™]) — s (0, v?), 


where, denoting by Ui, U2, ... an i.i.d. sequence of standard uniformly distributed 
random variables, . 


V? := Blot (Ui, ..., U^] 


(4.4) i 
+2 X Elok (Ury, ..-, Uk (Ukit, --- U14) 
j=1 
with, for ui, ... , ug41 € (0, 1), 
Pk (Ug, 1) 
= Øk (Uk41,---, U1) 


k+1 


— 5$, Elon (Uii, UU = ui] + KEL ge (Ursi, . .., U1)]. 
[=] 


PROOF. To prove part (i) of the lemma, we only need to show that, un- 


der J6$ ^, as n — oo, EHDE” Y |Z() ] = op (1), where 
DP = Jn — (Gg? — E[S(?IN(]) — (TE p — BLT, IZE) 


Since the maximal invariant (N°, RV?) depends on Z^ only through NV?, we 
actually have E[{D,”}?|Z()] = (n — k) Var[ Sy” — (D ze Conditionally on 
Ze (and hence on N™)), M D fk is a linear serial rank statistic in the sense 
of Hallin, Ingenbleek and Puri [5]. Corollary 2 of Lemma 2, and Lemma 4 (Ap- 


pendix 3) of that paper imply that there exists a constant K (not depending on n) 
such that 


(n) \2iry (n) K ) 
EIL DO? 1717) < (2k +14 ——— 
E[ [a£ (N™; RD s RY) 
— OF ( ze , F(z')yy Zs: 


SIGN-AND-RANK STATISTICS 213 


By (4.1), the last term converges to zero in probability under Hy as n — oo, 
which completes the proof of (4.3). 

The asymptotic normality of Vn — E(T® „y — EITS? plo D [part (ii) of 
Lemma 4.1], hence also that of ./n — k(S{ s .BISÍPINO), is also established 
in [5]. The special form of V? follows Bon Yoshihara's [20] central limit the- 
orem for U-statistics under absolutely regular processes, which requires the 
(2 + 6)-integrability of the score-generating function gg. C 


Note that the right-hand side in (4.3) is exactly the same as in the asymptotic 
representation of the purely rank-based serial statistic 


n ge R™ 
n—k)7! (=; ; m 
( ) ae "k n--l' ° n+l 


(n) (n) 
R' Re 
-E e -- 5 » o( jj" x), 


t—k--1 





This remark, which is analogous to the remark made in the nonserial case just be- 
fore the proof of Lemma 3.1, will play a crucial role in the proof of the asymptotic 
normality part of Proposition 4.1 (1i). 


4.2. Exact and approximate scores. As in the nonserial case, two types of 
scores—the exact and the approximate ones—are naturally associated with a given 
score-generating function. Define (referring to Section 3.2 for notation) 


se 


(n) ). pn) pln) (n) 
eu;ex/appr 7^ — n—k 22 pisi Ri Rey, R 
t= 


tk}? 


where, for (7, v) € {0, 1,...,n}’, vxn-—mqandl-ziizioZz- xiu SN, 
a% (Cm 9): is sli) 
= E[ox (Uf), m R =n, ND = 
RY Pb " 
and 
ap. appr (15 V) ii, iet) 
= pe (ELU P IN =, ND. =v, RG ) = iy], 


ELUR ING = n, NG, = v, RiP = ik+1]) 


=o ( ili < = (zs 5) +i >n- (5 m Uu) 





274 M. HALLIN, C. VERMANDELE AND B. WERKER 


i 


I) iggy — (a — 2) 
I — vll = + I. 
T [ik n xz F 20 £1) 
The following lemma provides sufficient conditions for øx to be a score-generating 


(n) (n) 
function for ud and for oL apir 


LEMMA 4.2. Let øx: (0, 1)**! — R be nonconstant and square-integrable. 
Then q is a score-generating function for ac ex: Uf, moreover, qx is a linear com- 
bination of a finite number of square-integrable functions that are monotone in all 
their arguments, then qy is also a score-generating function for a” dub 


PROOF. The proof easily follows along the same lines as in the nonserial case 
and is left to the reader. (J 


4.3. Unconditional asymptotic representation. Lemma 4.1 was only an inter- 
mediate, conditional result; the following proposition provides the corresponding 
unconditional asymptotic representation and asymptotic normality. Define 


Lg = Øk ug, ..., Ug) du duy, 
: Lu s 

peann u 9°27 ey u du see du 4 
Loy = pk (Uo k) dug k 


== k(uo, ..., uk) duo- duk 
"ei 1/2198 ^ 


and, for v — 1,2,...,k, 


ie oD 


0xi, «yk 


I, roel eft /2, 1] "m 
Qx(uo, ..., ux) duo--- duy. 


PROPOSITION 4.1. Let «y be a nonconstant square-integrable score-gene- 
rating function for UR /appr* Whenever approximate scores are considered, as- 
sume that qy is a linear combination of square-integrable functions that are 

(n) 


i - ! (n) (n) 
monotone in all their arguments. Then, writing S," for either S sce Un S cebat 


(i) (Asymptotic representation) under Sp as n — oo, 
st? ust 


_ m(n) (n) (n) 
(4.5) ——T pe — EITS plz] 


SIGN-AND-RANK STATISTICS 275 
+2 ing” — 1) (n — k)! 
x lue: >k+ 1]N 9? (NE (n) abes (Nf? i k)pO 
k 


t Y I[k+1-v sN” <n-y] 


v=l 
x NP (Nf? — 1)... (N® —k +0) 
x NP (NT — 1) (NP — v + Du 


4 I[N > >k+ 1]1N 9? (ne (n) — bes: (Nf = out] 


— Le + op(1/./n); 


(ii) (Asymptotic normality) if, moreover, gy is (2 + 6)-integrable for some 
6 > O, then, under HP, as n — OQ, 





k+l 2 
Vn=k(S — ELS”) / |V2+ + D| u -2 F vu% /(k + J 


(4.6) x A (0, 1), v=l 


with V? given in (4.4). 


When the score g is skew-symmetric with respect to 1/2 [i.e., øk (uo, ..., Ui, 
m —Qk(uo, ...,1 = Uj,..., Ux) for all i = 1,...,], then Lo, = 
vest) H uw = = ( with 


(v) — page| u ux) duo--: du 
Lo, ( p EEE E O» -- uk) duo k» 


-1 -1 —1l 
o k+1 1) (k-*1 2). k 4-1 3). 
uQ--(Fi!) d 21) ag--(*3!) ig 


This and the fact that N® /n — 1 = Op(n-1/?) implies that the right-hand side 
of (4.5) reduces to T . , — y 4126 ] + op(1). Hence, the conditional (4.3) 


fS n) 


and unconditional Bed litio dos o EIS; s™) coincide. 


PROOF OF PROPOSITION 4.1. As in the nonserial case, we first prove the as- 
ymptotic representation result for exact scores. From the definition of exact scores, 


216 M. HALLIN, C. VERMANDELE AND B. WERKER 


we — for Sf) = SO)... writing T4? for Tg = gt Ehari GRP (Z0), 
F(Z), 


ELSK” IN| 
= EET,” RNIN] = E[T/? IN] 





l 2 n 
=| > Bia hea F(Z/9,))INO, sf, .. E | 


n—k t—k4-1 
where 
Elpr(F (Z9), ..., 2 (7208 Ts EUNT NN LAT 
= 2H! Í ss 
vnd qx (uo Hk) 
(4.7) 


x I[sign(uo — 4) = sf" 


T sign(us = j) = = real dug ee duy. 


The asymptotic representation (4.5) (for exact scores) follows by combining (4.7) 
and part (1) of Lemma 4.1. Turning to approximate scores, it is sufficient for (4.5) 
to hold that 


(4.8) EU :— [E S appr NO] — ES appel} — ELS NO] — El Soren} 
be op(1/./n). Note that 


Pk, appr 
=(n(n—1)---(n—-by! 


Lesh ze Sn 


: 1 TN (n) 
e (Ilr < N|(—_ 5 — ) rti» NO 15 TUN eic B ) 


ANP +1) 2. ANP +1) 


(n) tk] 
os Ting < NS (— | 
| ANP +1) 


1 — iku NO 
+ Hie NY: + XN n >)) 


For notational simplicity, let us consider the case k = 1; the general case follows 


SIGN-AND-RANK STATISTICS 277 


along the same ideas. For k = 1 we have 


ELS INC] — E[SQP IN] 








9;appr 
uos EX ^ui 
n — (n) : (n) 
at 1<i#j<n” 2(N- +1) 2(N +1) 
N® a (n) 
1 j—N 
+E X w(t) 
j=l yO, ND 02 ANP 
(n) 
n NL : (n) : 
1 i- N` j 
+ È Ya(+ a _) 
j=n41 J=! 2. ANP +1) ANP +1) 
; (n) : (n) 
1 i— N 1 j- N} 
F x "(o zr 2 EE 9] 
NO 41<i¢j<n 
4 (n) (n) y Q2 0 
- IN? > 2]N (n™ — 1,0 
a p UIN- z2]N- (NE Hpi 


+I[1 < NP «n — 1] NP NP pO 
HNP = INP (NP - 108] 


AND NO -1t | (vy 


(4.9) 


————————— D s 
n(n — 1) NON — 1) NO, Ho 
(Nm)? | 
— NO NV? — 1) (uy 
NO 
«Yes x9 
s ANP +1) 
4N®N® a ANON 


+ — +— co 
n(n — 1) { NO NY > Mo, } T OS D D wi. yo = Ko, } 


any? (NS? —D* (WN "is ptt d 
n(n — 1) NO (Nf? — 1) De yt 7 Poi 
(Ny l 


NO (NO — 1) 4qN y 


278 M. HALLIN, C. VERMANDELE AND B. WERKER 
NO? ! 
i 1 i 
a | 
2. 2 2(NPP 4.1) 2 ANP +1) 


where xt := max(0, x), 








1 Z i j 
DS a 
fm" Aem 2. 2. VI 2T D'2m-4 5) 
1 4 
D, :——— 
£,m Am = 
1 





+ 


1 i j ) 
2 2(-c-1)2(m-c-0D/' 


1 1 j 
ptt. | ( ) 
im = agg 2 29 ace 2° m+) 


are Riemann sums for the integrals 


1/2 pl/2 
B =| gı (uo, uj) dug du, 
0 0 
Qo [npn 
Ko, =| J pı (uo, u1) dug dui, 
0 J1/2 
" | p1/2 
Ho =l f pı (uo, u1) duo dui, 
1/2 JO 


1 al 
a = | pı (uo, u1) duo du, 
1/2 J1/2 


ponen Here again, due to the fact that v; is square-integrable, the function 
(u, v) œ> OF (u, v) := gı (u, v)I[u = v], (u, v) € [1/2, 1]?, which vanishes except 
over the cagon or the m Square, is integrable and has integral zero. Hence, 
(1/4m?) YT. o G + aF 2 + gry) as a Riemann sum for the integral of 
qr over [1/2, 11^, is o(1). Since 


2 
i ol d 
[Xe 2m40'2^ 2m E 


m i ol d 
$m) si e^ Pa 


it follows that (1/4m?) Y | 1(4 + GED ll smarty) is o(1/./m), as m — oo. 
A similar result holds for (1/4m?) $4 91 Grp: xm as well as, of course, 


SIGN-AND-RANK STATISTICS 279 


for any individual terms such as (1/4m?)gi(5 + 55 cg. 5 + erp). Thus, (4.9) 
as n — oo takes the form 
) (n) 
E[57 y NORS, aN] 


—AN@ (NS -Dt = 
n(n = 1) | NO) yO 


he oes aid. v npa ep 
n(n — 1) | NT NM Lo, ] 


_ ut- 
n(n—1) * NP, N” Hg, ] 


Any? (NY? — 1+ 


mod Owe ~ Her] + orn). 


N® NP 
Considering the difference D7, — Hy, » we have 
DEUS 1 p i j 
ptt utt a (5 MN NER. x) 
mm o Pp = aa 2,2, 2'2(m41)!2  2m41) 
-JJ Qı (uo, u1) duo dui 
[1/2, 1} 
gr Y Lalat TT D'2 * xe) 


"n Jl. T pı (uo, u1) duo dui + o(1/4/m ), 


(4.10) 


because, in view of the same argument as above, the two first sums in (4.10) are 
o(1/./m). As in the proof of Proposition 3.1, due to the fact that g; can be as- 
sumed to be nondecreasing in its two arguments, the sum that appears in this latter 
expression is composed between the two Darboux sums 


and 
jy mimi 1 j 
T aga D balata ata) 


These Darboux sums also converge to the integral ffi i 91 (uo, 41) duodu: 


280 M. HALLIN, C. VERMANDELE AND B. WERKER 


and 


1 1 m—-11 m-1 l 1 
Pie - Bis malala mat m ) Aa) 
mo m= aa A 72m 2* 2m | "2'2 
the same argument still implies that this difference, hence also Dy, — Hos is 
o(1/4A/m ). The other three quantities of the same type can be treated similarly. 
Uniform integrability and the fact that N} (? are Op(n), as in the proof of Proposi- 
tion 3.1, complete the proof that (4.8) is indeed op(1/./n ). 


To conclude, we now prove the asymptotic normality result. Denote by IIx; 
the set of permutations x of {1,...,k +1}. Then 


(n) (n) 
EL Soy ex IN ý ] 


= k+l 
a n kl, (v) 
=(,",;) xx ura 





] Xt; €^ <t4) <n Vv—0 
[si = (n) .. 
A 2,1 [Sr = l, “Sew l : 
mw EIT x41 
(n) (n) l 
S 7 -L.. Doo 77 m 
hence ES IN™] is a U-statistic computed from the n-tuple Z5. E 
with kerne 
hk(z1,..., Zk+1) 
kl (v) 
2e bho 


— 1)! 
x > Ize) > 0,..+5Z0@) > 0, Zest) <0, ..., Zn €0]. 
welt; 


Routine calculation yields, under H 


BUZZ ze] 
k+! 
k+l-v (v) 


V 
v=0 v=0 


and 


k+1 2 
V 
Var(E[ (Zi... ZIP) = ua us rm eel ! 
p= 


SIGN-AND-RANK STATISTICS 281 


which is strictly positive. Classical results on U statistics (see, e.g., [19]) then 
imply that, under HS}, as n — oo, 
(n - S [SE IN] - ES] 


PRIEX 
k+1 2 
- (s de D bin - Y uel ) 


The same argument as in the nonserial case can be invoked to establish the as- 
ymptotic independence of the right-hand side in the conditional asymptotic rep- 


resentation (4.3) and (n — E) ELS IN] — E[SO?..). The result follows. 
J 


4.4. Example: first-order median moving average. The central sequence (2.5) 


under p g clearly [central sequences are always defined up to op(1) quantities] 
can be rewritten as 


AP =n- Ir? + op(1), 
where, defining gp (u) :— —— and y(u) :— F7! (u), u € (0, 1), 


(n) .— 


rE = we (F(Z)), 





a measure of serial meris associated with f. With this notation, it clearly ap- 
pears that an is a particular case [letting k = 1 and g; (uo, u1) :— gf (uo) V r(u1)] 
of the statistic HA v. fx Considered in Lemma 4.1. 

Define the serial linear sign-and-rank autocorrelation statistic of order 1 (based 
on en scores) as p -— Eir IN™® , R(?]. Proposition 4.1 implies that, un- 
der p? fg: aS n — OO, 


rt =r + op(1/ n) 


with 
pins po 7 (n) 
rey =lp Erga] 


_ 2 (n) (n) ¢x7(n) 0 
» n(n — 1) TRA z2]NT7 (NT — 1)(— f ©) EE 


t I[1 € N? <n- 1]N NO? 


x (-F [| stes FO f isreoas) 


282 M. HALLIN, C. VERMANDELE AND B WERKER 


4 I[NÍ? > 2] f0 f "D dz 
+ op(1//n) 


_ i) € 2] 42 NY? - NT 
= rh Birao] H 240),  — — + op (1//n) 


af (m) zm (x) (n) 
[2 yd den 3 








NO m NO 
-2f(0)u; jy cwm -Fop(1/4/n) 
1 no gl Š " no) — NO? 
"X - PY. - np) +24 Ou —— orn) 
t=2 


Letting 


(4.11) r Phap = — za eas (Rf) 


with RY” given in (3.22), the approximate score counterpart of r p -ex 18, in View 


of Lemmas 4.1 and 4.2, 





(n) (n) 
NU —N- 
Etape = - appr — Ely agg NT] 2/01; 
P e H Eoee”, 
NO u NO 
- 1) Bee er (Ry Vs (Ry) +2 Ons —+——. 
Ixt; zt? En 
In conclusion, defining AY* := n — Ire and ATE, = VR — 1 x 
Ae , we obtain 


7 fil;ex/appr 


A*S AQ + op(1) = AC + op(1) 


under Ie as n — oo. Using standard arguments, one easily verifies that A 


Ara and à are indeed three uin of the semiparametrically efficient 
(n)x 


central sequence for 0 in the model ge". ; again, the sign-and-rank-based A ;. appr 


can be used to perform semiparametrically efficient inference (tests, estimation, 
etc.) for the MA(1) coefficient 0; see, for example, Section 11.9 of [16]. 


SIGN-AND-RANK STATISTICS 283 


5. Numerical study. The finite-sample performance of the proposed test sta- 
tistics has been studied in the context of the first-order moving average model 
of the example in Section 4.4. More precisely, we generated 1000 replications of 
each of the MA(1) processes characterized by equation (2.4) with parameter values 
Q = +0.3, +0.25, +0.20, +0.15, +0.10, +0.05 and 0, and the following asym- 
metric innovation densities: 


(a) f(z) :— failz € 01+ fw Miz > 0], where f; stands for the Cauchy 
density and fJ(o,1) for the standard normal one; 

(b) f(z) := fetl € 0] + favor (liz > 0], where f; stands for the Student 
density with 5 degrees of freedom; 

(c) f := fN, (the skew normal density with skewness 4 = —10; see [1]), duly 
shifted and rescaled to have zero median and unit variance; 

(d) f := fM (the skew normal density with skewness A = —20), duly shifted 
and rescaled to have zero median and unit variance; 

(e) f := 0.5 fu(0,1) + 0.5f,y (55,2) (a mixed-normal density), duly shifted and 
rescaled to have zero median and unit variance; 

(f) f := 0.75 fy (9,1) + 0.25 fy (—5,1) (a mixed-normal density), duly shifted and 
rescaled to have zero median and unit variance. 


For each replication, randomness (namely, 6 = Q) has been tested against first- 
order moving average dependence (two-sided test), based on the asymptotically 
normal distribution of: 


(1) the ordinary first-order autocorrelation coefficient 
rt? :— (n — 07 Y - Z)(z,., - 2) /n M -20y 


(1) the “traditional” n van der Waerden rank Mor coefficient 
(n) R™ 

(n) zi Ri r-1 
dd rap een 


NU aif 3s f. (n) 
eon BE eG) 


1 xiEj en 





where d stands for the standard normal distribution function and A 
stands for the exact standardizing constant (see, e.g., [8]); 
(iii) the "traditional" first-order Wilcoxon rank autocorrelation coefficient 


(n) IN did RIO, 
en (e OD ea) Cei) 


-[n(n- DI ee vos c) )] o). 


] xix jn 





284 M. HALLIN, C. VERMANDELE AND B. WERKER 


with giog(u) := 2u — 1 and Wiog(u) :— In(1 5), u € (0, 1) (Yog is propor- 
tional to the inverse of the logistic distribution function); on stands for the 
exact standardizing constant (see, e.g., [8]); 


(iv) the “traditional” first-order Laplace rank autocorrelation coefficient 
n (n) (n) 
(n) =i R, ) (= ) 
poU —1 —— d hat 3 


sina s ss pep — )Ve( —) | "s 


| xi35J <n 
With Pexp(u) :— sign(2u — 1) and 
Wexp(u) := InQu)I[u x 0.5] — In2(1 — u) [u > 0.5], u € (0, 1) 


(Vexp 1s proportional to the inverse of the double-exponential distribution 
function); a stands for the exact standardizing constant (see, e.g., [8]); 
(v) the first-order sign-and-rank autocorrelation coefficient r MN defined 


in (4.12), with the approximate scores gf(u) = 7 log(u)1 fu x 0.5] + 
$ 1 (u)1[u > 0.5] and wp (u) = y Yioglu) I [u x 0.5] 4- $^! (u) {u > 0.5] as- 
sociated with a density f(z) := dnb rl [z € 0] + fw«o,n GHz > 0] 
(with y :— ./7/8) that is logistic on the negative half-line and standard nor- 
mal on the positive half-line (yielding Wilcoxon scores for the negative resid- 
uals and van der Waerden scores for the positive ones); 


(vi) the first-order sign-and-rank autocorrelation coefficient r P defined 
in (4.12), with the approximate scores 


ej) = Eu « 0.5] E $7 Qu) Hu > 0.5] 


and 
Wp (u) = y Vexp(u)H [u < 0.5] + $^! (u) [u > 0.5] 


associated with a density f(z) := ay exp(z/y)I[z x 0] + fwoo,1)(z)/[z > 0] 
(with y = ./m/2) that is double-exponential on the negative half-line, and 
standard normal on the positive half-line (yielding Laplace scores for the neg- 
ative residuals and van der Waerden scores for the positive ones). 


The results of these simulations (series length n = 250; number of replications 
1000) are summarized in Figures 1—6, where the graphs of the empirical power 
functions associated with testing procedures (i)-(vi) are plotted against 8. 

These graphs speak for themselves and need little comment. They all clearly 
demonstrate the superiority, under asymmetric densities, of the sign-and-rank 


SIGN-AND-RANK STATISTICS 285 





52 -0 1 0 01 

theta (the parameter of the MA(1) model) 
FiG. 1. Empirical power, under Cauchy/standard normal innovations (a), of various parametric, 
rank and sign-and-rank tests for randomness against first-order MA dependence [based on the test 
statistics (1)-(v1)]. The series length 1s n = 250; 1000 replications were performed. 


methods over both their classical Gaussian and traditional rank-based competitors. 
The more skewed the underlying density, the more significant the improvement. 
For instance, in Figure 1 [Cauchy/Normal density (a)] the percentage of rejec- 





-02 O14 01 
theta (the parameter of the MA(1) model) 


FIG. 2. Empirical power, under Student (5 d.f. standard normal innovations (b), of various para- 
metric, rank and sign-and-rank tests for randomness against first-order MA dependence [based on 
the test statistics (1)-(v1)]. The series length is n = 250; 1000 replications were performed. 


286 M. HALLIN, C. VERMANDELE AND B. WERKER 


01 





34 03 02 01 0 01 02 03 
theta (the parameter of the MA(1) model) 
FIG. 3. Empirical power, under skew-normal (X = —10) innovations (c), of various parametric, 
rank and sign-and-rank tests for randomness against first-order MA dependence |based on the test 
statistics (1)--(vi)]. The series length is n = 250; 1000 replications were performed. 


tion at 0 = —0.05, which is only 0.0240 for the traditional correlogram-based tests 
(a severely biased test, thus), is as high as 0.7720 for the sign-and-rank Laplace/van 
der Waerden tests (vi). At 0 = —0.10, the corresponding figures are 0.2460 for 





02 -01 0 01 
theta (the parameter of the MA(1) modal) 


FIG. 4. Empirical power, under under skew-normal (à = —20) innovations (d), of various para- 
metric, rank and sign-and-rank tests for randomness against first-order MA dependence [based on 
the test statistics (1)-(v1)]. The series length is n = 250; 1000 replications were performed. 


SIGN-AND-RANK STATISTICS 287 





24 03 02 -01 02 03 


0 01 

theta (the parameter of the MA(1) model) 

FIG. 5. Empirical power, under mixed normal innovations (e), of various parametric, rank and 

sign-and-rank tests for randomness against first-order MA dependence [based on the test statistics 
(1) -Xvi)]. The series length is n = 250; 1000 replications were performed. 


the correlogram-based tests, but 0.9770 for the Laplace/van der Waerden ones. 
Of course, the performance of the parametric correlogram method in this case 
is particularly poor, due to the absence of finite moments, but the superiority of 





02 03 


24 03 -02 -01 0 01 
theta (the parameter of the MA(1) model) 


FIG. 6. Empirical power, under mixed normal innovations (f), of various parametric, rank and 
sign-and-rank tests for randomness against first-order MA dependence [based on the test statistics 
(i)-Xvi)]. The series length is n = 250; 1000 replications were performed. 


288 M. HALLIN, C. VERMANDELE AND B. WERKER 


the sign-and-rank-based methods over their "purely rank-based" competitors re- 
mains quite substantial (at 0 — —0.05 and 0 — —0.10, Wilcoxon tests only yield 
empirical powers 0.4360 and 0.8250). Quite understandably, this superiority of 
the sign-and-rank methods over their competitors fades away under moderately 
skewed densities (see Figure 2, where it is less pronounced than in Figure 1), but 
it remains extremely substantial in Figures 4—6. 


Acknowledgments. The authors would like to thank two referees and an As- 
sociate Editor for their pertinent remarks on the original manuscript and for their 
criticisms, which led to a complete rewriting of the paper. 


REFERENCES 


[1] AZZALINI, A. (1985). A class of distributions which includes the normal ones. Scand. J. Statist 
12 171-178. MR0808153 
[2] DUFOUR, J -M., LEPAGE, Y. and ZEIDAN, H. (1982) Nonparametric testing for time series: 
A bibliography. Canad J. Statist. 10 1-38 MR0660939 
[3] HÁJEK, J. and ŠDÁK, Z. (1967). Theory of Rank Tests. Academic Press, New York. 
MR0229351 
[4] HÁAJEK, J., SIDÁK, Z. and SEN, P. K. (1999). Theory of Rank Tests, 2nd ed. Academic Press, 
San Diego. MR1680991 
[5] HALLIN, M., INGENBLEEK, J.-F. and PURI, M. L. (1985). Linear serial rank tests for ran- 
domness against ARMA alternatives. Ann. Statist. 13 1156-1181. MR0803764 
[6] HALLIN, M. and PURI, M. L. (1988). Optimal rank-based procedures for time-series analy- 
sis: Testing an ARMA model against other ARMA models. Ann. Statist. 16 402—432. 
MR0924878 
[7] HALLIN, M. and PURI, M. L. (1991). Time-series analysis via rank-order theory: Signed-rank 
tests for ARMA models. J. Multivariate Anal, 39 1-29. MR1128669 
[8] HALLIN, M. and PURI, M. L. (1992) Rank tests for time series analysis: A survey. In New 
Directions in Time Series Analysis. Part I (D. Brillinger, P. Caines, J. Geweke, E. Parzen, 
M. Rosenblatt and M. S. Taqqu, eds.) 111—153. Springer, New York. MR1235582 
[9] HALLIN, M. and PURI, M. L. (1994) Aligned rank tests for linear models with autocorrelated 
error terms. J. Multivariate Anal. 50 175—237. MR1293044 
[10] HALLIN, M., VERMANDELE, C. and WERKER, B. J. M. (2004). Semiparametrically efficient 
inference based on signs and ranks for median restricted models. Technical Report 0403, 
Tilburg Univ. 
[11] HALLIN, M. and WERKER, B. J. M. (2003) Semi-parametric efficiency, distribution-freeness 
and invariance. Bernoulli 9 137-165 MR1963675 
[12] HOROWITZ, J. L. and SPOKOINY, V. G. (2002). An adaptive rate-optimal test of linearity for 
median regression models J. Amer. Statist. Assoc. 97 822-835. MR1941412 
[13] HUSKOVÁ, M (1970). Asymptotic distribution of simple linear rank statistics for testing sym- 
metry. Z Wahrsch. Verw. Gebiete 14 308—322. MRO277050 
[14] JUNG, S.-H. (1996). Quasi-likelihood for median regression models. J. Amer Statist. Assoc. 
91 251—257. MR1394079 
[15] KOENKER, R. (2000). Galton, Edgeworth, Frisch, and prospects for quantile regression in 
econometrics. J. Econometrics 95 347-374. MR1752335 
[16] LE CAM, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer, New York. 
MR0856411 


SIGN-AND-RANK STATISTICS 289 


[17] MCKEAGUR, I. W., SUBRAMANIAN, S. B. and SUN, Y. Q. (2001). Median regression and 
the missing information principle. J. Nonparametr. Statist. 33 709—727. MR1931164 

[18] PURI, M. L. and SEN, P. K. (1985). Nonparametric Methods in General Linear Models. Wiley, 
New York. MR0794309 

[19] SBRFLING, R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley, New 
York. MR0595165 

[20] YOSHIHARA, K.-I. (1976). Limiting behaviour of U -statistics for stationary absolutely regular 
processes. Z. Wahrsch. Verw. Gebiete 35 231—252. MR0418179 

[21] ZHAO, Q. S. (2001). Asymptotically efficient median regression in the presence of het- 
eroskedasticity of unknown form. Econometric Theory 17 765—784. MR1862375 


M. HALLIN C. VERMANDELE 
INSTITUT DE STATISTIQUE ET INSTITUT DE STATISTIQUE ET 
DE RECHERCHE OPÉRATIONNELLE DR RECHERCHE OPÉRATIONNELLE 

ECARES AND FACULTÉ DES SCIENCES SOCIALES, 
AND POLITIQUES ET ECONOMIQUES 
DÉPARTEMENT DE MATHÉMATIQUE UNIVBRSITÉ LIBRE DE BRUXELLES 
UNIVERSITÉ LIBRE DE BRUXELLES CP 124 
CP 210 B-1050 BRUSSELS 
B-1050 BRUSSBLS BELGIUM 
BELGIUM E-MAIL. vermande @ulb.ac.be 
E-MAIL: mhallin@ulb.ac. be 

B. WERKER 

ECONOMETRICS AND FINANCE GROUP 

CENTER 


TILBURG UNIVERSITY 

PO Box 90153 

5000 LE TILBURG 

THE NETHERLANDS 

E-MAIL: B.J.M.Werker @uvtnl 


The Annals of Statistics 

2006, Vol 34, No 1, 290-325 

DOI 10 1214/009053605000000796 

© Institute of Mathematical Statistics, 2006 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION FOR 
LIFETIME DATA 


By JIANQING FAN!, HUAZHEN LIN AND YONG ZHOU? 


Chinese University of Hong Kong and Princeton University, Sichuan University 
and Chinese Academy of Science 


This paper considers a proportional hazards model, which allows one to 
examine the extent to which covariates interact nonlinearly with an exposure 
variable, for analysis of lifetime data. A local partial-likelihood technique is 
proposed to estimate nonlinear interactions. Asymptotic normality of the pro- 
posed estimator is established. ‘The baseline hazard function, the bias and the 
variance of the local likelihood estimator are consistently estimated. In addi- 
tion, a one-step local partial-likelihood estimator is presented to facilitate the 
computation of the proposed procedure and is demonstrated to be as efficient 
as the fully iterated local partial-likelihood estimator. Furthermore, a penal- 
ized local likelihood estimator is proposed to select important risk variables 
in the model. Numerical examples are used to illustrate the effectiveness of 
the proposed procedures 


1. Introduction. One of the most celebrated models for analyzing lifetime 
data is the Cox proportional hazards model, which explicitly postulates the covari- 
ate effects on the hazard risk via 


A(t) = Ao(t) exp{g(Z)}, 


where Ao(-) is the baseline hazard risk and g(Z) reflects the covariate effect. In 
parametric models it is commonly assumed that 


g(Z) - 8'Z 


for some unknown parameters f. See, for example, [1] and [20]. The log-linear 
model is a simple and mathematically convenient model that provides useful analy- 
sis for a covariate effect. However, in many biomedical studies, the covariate 
effects can be more complicated than the log-linear effect and new analytic chal- 
lenges arise in assessing nonlinear effects. Beyond the traditional linear model, 
there are infinitely many possible nonlinear forms. Depending on the background 


Received December 2002; revised November 2004. 
I Supported in part by NIH Grant RO! HL69720, NSF Grant DMS-03-55179 and RGC Grant 
CUHK 4262/01P of HKSAR. 
2 Supported in part by the Fund of National Natural Science (Grant 10471140) of China. 
AMS 2000 subject classifications. Primary 62G05, secondary 62N01, 62N02. 
Key words and phrases. Local partial likelihood, one-step estimation, varying coefficient, propor- 
tional hazards model, variable selection. 


290 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 291 


of study, one often chooses a form that reasonably explains the objective of the 
study. For example, the effect of exposure variables and confounding factors on 
the hazard risk may vary with the level of an exposure variable, denoted by W. 
This leads one naturally to consider the model 


(1.1) A(t) = Ao(t) expl W (0)! Z(t) + g (WE). 


Here f (-) and g(-) are unknown coefficient functions, characterizing the extent to 
which the association varies with the level of the exposure variable W. Note that 
the term g(W(t)) can be incorporated into the covariates Z(t) by introducing a 
dummy variable with column one. We opt to not do so, because the local intercept 
for g(-) will cancel out in the local partial likelihood (2.3) below, leading to a 
different estimator rule for g. For ease of presentation, we drop the dependence of 
covariates on time X;, with the understanding that the methods and proofs in this 
paper are applicable to time-dependent covariates. 

When the variable W is time, rather than a covariate variable, model (1.1) 
becomes a time-dependent coefficient Cox model, which has been studied by 
a number of authors, including Zucker and Karr [37], Murphy and Sen [31], 
Gamerman [21], Murphy [30], Marzec and Marzec [28], Martinussen, Scheike 
and Skovgaard [27], Cai and Sun [10], and Tian, Zucker and Wei [32]. In this case, 
unless the coefficient functions f (t) are independent of time ¢, the model is no 
longer a proportional hazards model. In contrast, model (1.1) is still a proportional 
hazards model. It allows one to examine the extent to which covariates Z inter- 
act nonlinearly with the exposure variable W. As will be explained later, although 
model (1.1) looks similar to the time-dependent coefficient Cox model, it is more 
involved when establishing asymptotic properties. 

The varying-coefficient models arise from many different fields and have been 
studied in many different contexts. For cross-sectional type data, they have been 
studied as models to explore nonlinearity and assess nonlinear interactions by 
Cleveland, Grosse and Shyu [14], Hastie and Tibshirani [24], Carroll, Ruppert 
and Welsh [12], Fan and Zhang [19] and Cai, Fan and Li [8], among others. In 
time series, they are extensions of threshold autoregressive models and have been 
used to enhance the predictive power of linear autoregressive models. See, for ex- 
ample, [13] and [9]. The varying coefficient models have also been widely used 
to analyze longitudinal data. They allow one to examine the extent to which the 
association between independent and dependent variables varies over time. See, 
for example, [7, 25, 35, 36]. 

In this paper we propose techniques for estimating the coefficient functions B(-) 
using local linear techniques [15]. The asymptotic bias and variance are obtained 
by establishing asymptotic normality. The variance is then estimated via a sand- 
wich formula, which is shown to be consistent. To save computation of the local 
partial-likelihood estimator, a one-step procedure is proposed, which is shown to 
have the same asymptotic bias and variance as the local partial-likelihood estima- 
tor. Implementation of the proposed estimator depends on the choice of good ini- 
tial estimators: estimates at the nearest grid points are recommended. The resulting 


292 J. FAN, H. LIN AND Y. ZHOU 


procedure is demonstrated to be quite effective in our numerical implementation. 
In addition, the baseline hazard function Ag(-) is estimated via a kernel method. 
The consistency property is demonstrated. 

An objective of survival analysis is to identify the risk factors and their risk con- 
tributions. At the initial stage of a study, many covariates are collected to reduce 
possible modeling biases, and a large model is built, namely the dimensionality 
of Z in (1.1) is high. An important and challenging task is to efficiently select a 
subset of significant variables from model (1.1). Fan and Li [17] proposed a fam- 
ily of new variable selection methods based on a nonconcave penalized likelihood. 
Their methods are different from traditional ones in that they delete insignificant 
variables by estimating their coefficients as 0, and simultaneously select significant 
variables and estimate regression coefficients. Lasso, proposed by Tibshirani [33, 
34], is a member of this family with an L; penalty. From their simulations, Fan 
and Li [17] showed that the penalized likelihood estimator with smoothly clipped 
absolute deviation (SCAD) penalty outperforms the best subset variable selection 
in terms of computational cost and stability in the terminology of Breiman [5]. In 
addition, they have proven that SCAD improves the lasso in terms of estimation 
biases. Furthermore, they have demonstrated that with a proper choice of regular- 
ization parameters and penalty functions (such as SCAD), the penalized likelihood 
estimator possesses an oracle property. Namely, the true regression coefficients that 
are zero are automatically estimated as zero and the remaining coefficients are es- 
timated as well as if the correct submodel is known in advance. Hence, the SCAD 
and its siblings are ideal for variable selection, at least from a theoretical point 
of view. These nice properties encouraged us to extend the technique to the non- 
parametric model (1.1). It gives us a quick and effective method for eliminating 
unimportant variables. 

The paper is organized as follows. Section 2 introduces the local partial- 
likelihood estimation and establishes the asymptotic normality. One-step esti- 
mation and estimation of the baseline hazard function are studied in Section 3. 
Section 4 deals with the issue of variable selection. Numerical examples are given 
in Section 5. Technical proofs are relegated to Appendix A. 


2. Partial-likelibood estimation. Suppose that there is a random sample of 
size n from an underlying population. Let 7, denote the potential failure time, 
let C, denote the potential censoring time and let X, = min(7,, C;) denote the 
Observed time for the ith individual. Assume that 7, and C, are independent given 
covariates Z, and W,. Let Aj be an indicator which equals 1 if X, is a failure time 
and O otherwise. The covariates Z and W are allowed to be time dependent. The 
observed data structure 1s 


(Xi, Ai, Z, Wi] 106 Sl s) 


where Z, = (Zj1,..., Zip)! and W, are two types of covariates, with W being an 
exposure variable of interest. 


LOCAL PARTIAL-LIKELTHOOD ESTIMATION 203 


When all the observations are independent, the partial likelihood for model (1.1) 
is 


: exp(BQV)TZ;-cg(QW)) — |^ 
2.1 LBO 80) = LI eei BOWATZ. a zWalb? 
(2.1) (BC), 8()) ss ener ie, ond 


where R(t) = (i: X, > t} denotes the set of the individuals at risk just prior to 
time t. 


2.1. Local partial likelihood. If the unknown functions f (-) and g(-) are para- 
metrized, the parameters can be estimated by maximizing (2.1). For our nonpara- 
metric estimation, since the forms of the unknown functions are not available, we 
can only rely on their qualitative traits. 

Assume that every component of f (-) and g(-) is smooth so that it admits Taylor 
expansion: for each given wo and w around wo, 

T P(w) © B (wo) + B'(wo)(w — wo) =ê + q(w — wo), 
g(w) © g (wo) + g'(wo)(w — wo) =a + y (w — wo). 


Substituting this into (2.1), we obtain the logarithm of the local partial likelihood, 
£(y, ô, n) 


n 
—n Y Ka (Wi — wo)A; 


i—1 


(2.3) X | ge n. Zi(W; — wo) + Y CW, — wo) 


— or( 3. exp{87Z, - 3! Z (QW, — wo) + y (W, — wo)) 
jeA(X) 


x Kx (W,; — 2) l 


where K is a probability density called a kernel function, h represents the size of 
the local neighborhood and K4(-) = K(-/h)/h. The kernel weight is introduced to 
confirm that the local model (2.2) is only applied to the data around wo. The local 
partial likelihood (2.3) can be derived from a profile likelihood point of view. The 
derivation is similar to those of Breslow [6] and Fan, Gijbels and King [16]. 

Let y (wo), (wo) and #(wo) be the maximizer of (2.3). Then B (wg) = 8(wo) 
is a local linear estimator for the coefficient function B(-) at the point wo. Sim- 
ilarly, an estimator of g'(-) at the point wo is simply the local slope y (wo), 
namely £'(wg) = y (wo). The curve g can be estimated by integration on the func- 
tion 2'(wo). Following Hastie and Tibshirani [23], the integration can be approxi- 
mated by using the trapezoidal rule. 


294 J. FAN, H. LIN AND Y. ZHOU 


We now express the local partial likelihood using the counting process notation. 
To this end, let N; (f) = 7 (T; < t, ^, = 1) and Y, (t) = I (X, > t). Set 


E—(87,57,y)" and X? -(ZP,ZI(Wi — wo), W, — wo)’. 
Then the local partial-likelihood function (2.3) can be expressed as 


tat) =n Y f Ky W, — wo" Xt dN; G0 
i=] 
ey " 
(2.4) n 2 Í K, (W; — wo) 


x isl YY, (u) exp(£? X*) Ki, (W; ~ 9| d Nj (u) 
Jsl 


with r = oo. To avoid the technicality of tail problems, only the data up to a finite 
time point t are frequently used. Without ambiguity, we will let £ (wo) be the 
maximizer of (2.4). 

Note that the local partial likelihood in (2.4) is more complicated than that for 
the time-dependent coefficient Cox model. In particular, the kernel functions ap- 
pear twice in the local partial likelihood (2.4), so as to use only local data. In con- 
trast, for the time-dependent coefficient model, localizing in time once suffices. As 
a consequence, the technical proofs are more involved in the current setting. 

The above method uses only one smoothing parameter to fit all the coefficient 
functions. When the coefficient functions admit different degree of smoothness 
[e.g., g' (w) often admits a different degree of smoothness from other coefficient 
functions], one needs to use different bandwidths for different components. The 
two-step estimation method of Fan and Zhang [19] can be adapted here. 


2.2. Asymptotic normality. We now establish the asymptotic normality of the 
local partial-likelihood estimator. As shown in Appendix A, the local partial- 
likelihood function £,(&, t) is concave in & and its maximizer exists with prob- 
ability tending to 1. Let H be a (2p + 1) x (2p + 1) diagonal matrix, with the 
first p elements 1 and the remaining p + 1 elements k, where p is the number 
of elements in Z. For any function E(w), w € J, let ||5$]|; = sup, c; |£(w)], for 
a p-vector a, let |a| — OF acy and ||a|| = sup; |a, |, and for a matrix A, let 
|Al| = sup;, |a,;|. Then we have the following consistency result. 


THEOREM 1. Under Conditions A.1—A.8 in Appendix A, we have 


H(£ (wo) — £o(wo)) —. 0, 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 295 


where &g(wo) = (Bl (wo), Bo(wo) , go (w9))7 is the vector of the true parame- 
ter functions. If, in addition, Conditions B.1—B.8 hold, then we have the uniform 
consistency 


^ ^ P 
|H(5 — Fo}llay = sup |B($(w) — $9(w))] — 0, 
wcJw 
where Jw is a compact subset of the support of the random variable W. 


To express explicitly the bias and variance of the estimator, we introduce some 
necessary notation. Let 


ui = J /ko dx, y = ELSO dx. 

Denote 

P(u,z, wo) = P(X -u|Z =z, W = wọ) and 

p(u, z, wo) = P (u, z, wo) exp{Bo (wo)! z + go(wo)]. 
For k = 0, 1,2, define 

az (u, wo) = f (wo) E{p (u, Z, wo)ZS* |W = wo], 
where f(-) is the density of W and Z9* = 1, Z and ZZ’ for k = 0,1 and 2, re- 
spectively. Additionally set 
T 
ay = ak (wo) = Í a (u, wo) d Ao(u). 


We will drop the dependence of a;(u, wo) and ag(wo) on wo whenever there is no 
ambiguity. Finally, let 


r —D(wo)-— |a: — [ ay (u)ai (u) ag! (u)Ao(u) du | 
and 
o-( 6 - LE —(a5 reo ir di 
—ag a; (a2 — aa; a Dye (a9 — ala, lay) 
where, in fact, ao is a scale. 


THEOREM 2 (Asymptotic normality). Suppose that Conditions A.1—A.8 in 
Appendix A hold. Then 


" »" £ 

Vnh{H(é (wo) — £o(wo)) — 3h^ep£g(wo)u2] —> N (0, E(t, wo)), 
where ep is a (2p + 1)-order diagonal matrix, with the first p elements 1 and the 
last p + 1 diagonal elements 0, &)(w) = (BÀ (w), Bow, gg (wo) * and 


296 J. FAN, H. LIN AND Y. ZHOU 


The above theorem gives the joint asymptotic normality for the local partial- 
likelihood estimator. Its marginal distribution can easily be obtained as in the fol- 
lowing corollary. 


COROLLARY 1. Under the conditions of Theorem 2, we have 
^ £ 
Vnh(f (wo) — Bo(wo) — k” Bg(wo)ua/2) —> N (0, voY), 


^ L a = 
Vnh3{@' (wo) — go(wo)) —> N(0, (ao — af az a1) wz). 
Furthermore, they are asymptotically independent. 


As a consequence of Theorem 2, the theoretical optimal bandwidth can be ob- 
tained. 


3. Issues related to partial-likelihood estimation. In this section we dis- 
cussed a few issues that are related to the implementation of the partial-likelihood 
estimator. 


3.1. One-step local partial-likelihood estimator. When estimating the whole 
functions B(-) and g(-), we usually need to apply the local partial likelihood (2.4) 
at hundreds of points. Computing such an implicit estimator requires an iterative 
algorithm such as the Newton—Raphson method or Fisher’s scoring method. Even 
worse, for certain given wo, there does not exist a local partial-likelthood esti- 
mator due to the limited amount of data around wo. These drawbacks make the 
local partial-likelihood estimator less appealing. Following Fan and Chen [15], we 
propose a one-step estimator as a viable alternative. 

The local partial-likelihood estimator E is found via solving the likelihood equa- 
tion £ (£, t) = 0, where £ (&, T) = 0£,(&, t)/0&. To facilitate notation, from now 
on we drop the dependence of £, (£, 1) on t. For a given initial estimator £o, by 
Taylor expansion we have 


£o) + £569) 6 — Eo) 0. 
Thus, the one-step estimator £ og I$ defined as 
(3.1) Bo = o — 569) 14,69). 
A natural question arises: How good an initial estimator £ o is needed for the 


one-step estimator to have the same performance as the maximum local partial- 
likelihood estimator. The following theorem gives an answer to this question. 


THEOREM 3. Under the conditions given in Theorem 2, É os has the same as- 


ymptotic distribution as the maximum local partial-likelihood estimator E, pro- 
vided that 


(3.2) H(& — £o) = Op (h? + (nh) !/?). 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 297 


Theorem 3 provides the conditions under which the one-step estimator performs 
as well as the local partial-likelihood estimator. However, it does not provide any 
guidance for choosing an initial estimator. Cai, Fan and Li [8] provided a useful 
strategy for the choice of initial estimators and their idea can be adapted to the cur- 
rent setting. The basic idea is first to compute the local partial-likelihood estimates 
at a few fixed points. Use these estimates as the initial values of their nearest grid 
points and obtain the one-step estimates at these grid points. For example, in our 
simulation studies we evaluate the functions at ngng = 200 grid points. We first 
compute the maximum local pseudo-partial-likelihood estimators at specific grid 
points 420, 460, 4100, 4140 and 4180, and then use them as the initial values for the 
one-step estimator at their nearest grid points. Use the newly computed one-step 
estimates (at points 419, 421, 459, 461, ...) as the initial values of their nearest grid 
points to compute the one-step estimates and so on, until the one-step estimates at 
all grid points are computed. Hence, as long as the number of grid points is large 
enough, condition (3.2) holds. 


3.2. Estimation of baseline hazard function. With estimators of f (-) and g(-), 
we can estimate the baseline hazard function by using a kernel smoothing, 


Áo() = | Wy(t — x) dAo(x), 


where W is a given kernel function, b is a given bandwidth and 


^ lf dN, (u) 
e —Ó—— 
i i XI n-157 a1 Yj (u) exp(B! (WZ, u) + 8(W,)) 


Note that Ao(-) is an estimate of the cumulative hazard function Ao. 


THEOREM 4. Under Condition B in Appendix A, we have 
Ag(t) — Ao(t) and p(t) — Ao(f) 


uniformly on (0, t] in probability. 


3.3. Estimation of biases and variances. The biases of nonparametric esti- 
mates are generally hard to estimate, since they involve higher-order derivatives. 
However, their variances can be estimated quite reasonably. Thus, in construction 
of confidence intervals/bands, the bias components are frequently omitted; in par- 
ticular, undersmoothing procedures have been used to make the biases negligible 
relative to their standard error. See, for example, [4, 22, 26]. Some people might ar- 
gue that this is also the approach that parametric methods take—modeling biases 
are inevitable and they are simply ignored in the construction of the parametric 
contidence intervals. 


208 J. FAN, H. LIN AND Y. ZHOU 


The bias and covariance of these local estimators H (Ê (wo) — &g(w0)) can be 
estimated by 


Âz l(t, wo)Ba (T, wo) and (nh) !À; (v, wo) fl, (z, wo)À; (t, wo), 


where 
: ] 4 rt 
An(t, uo) =— 3^ |. KAW: — wo) 
s i=] 0 
, Sau, wo) Sno(u, wo) — Sai (u, wo) Su (u, wo)? 
(Sno(u, wo)? 


^ ]pac pt S. ^ 
Bu) = LY d Ka (Wi — wo (Uto — Foy aniio du, 


Suo (4, WO 


d Ni (u), 


GE à 
II, (z, wo) = — M Ki (W, - wo) (Uto — ar Y; (u)à; (u) du, 


Sno(u, wo) 

with A;(u) = exp(B (W;) Z, (u) + &(Wi))Ao(u), UF = BOX? and 
Sk (u, wo) = >> Kp (W, — wo)Yi(u) exp(& (wo)Xf (u)) (Uf (u)8*, 
k=0,1,2. 
THEOREM 5. Under the conditions of Theorem 4, we have 
h^? À. (T, wo)Bn(t, wo) — ep£" (wo)u2/2, 
À; (Tt, wo) in (T, wo)Á, | (v, wo) — E(t, wo) 
in probability. 


In fact, by using the martingale properties, we can construct different estimators 


of B, (t, wo) and I(r, wo) without estimating the baseline hazard function Ao(-). 
That is, 


x Su, wo) 
=— = ee aN 
B, (1, wo) 20 M K (W, w)(0; (u) — $ cQ. id ji (u), 
íí,(, "jer Df K2(W, - wy (Uta) - pour dN,(u). 
Snol, wo) 


The results of Theorem 5 still hold when the quantities B, (t, wo) and IL, (z, wo) 
are replaced by B, (rt, wo) and T1,,(v, wo), respectively. 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 299 


One can also use the bootstrap method as in [32] to obtain an estimated variance 
for our estimators. In fact, the method is particularly useful for estimating the 
sampling variability of £(w), since its analytic form is hard to derive. 


4. Variable selection via nonconcave penalized likelihood. 


4.1. Local penalized likelihood. For the nonparametric model (1.1), it is not 
easy to give a variable selection procedure without going to detailed inferences on 
each coefficient function. Motivated by the work of Fan and Li [17, 18], we apply 
their procedure locally around each grid point wg. This results in the penalized log 
partial-likelihood function 


2p4l 
(4.1) OCE) =L, 1) — 5, pe(15;D. 

j=l 
where po(-) is a penalty function. The penalized local partial-likelihood estimate 
of £ is to maximize (4.1). With a proper choice of o and a penalty function, many 
estimated coefficients will be zero and hence their corresponding variables do not 
appear in the model at the point wo. This achieves the objective of variable selec- 
tion and results in a simple and implementable method to begin with. 

A good penalty function should result in an estimator with the following three 
properties: unbiasedness for large coefficients to attenuate biases, sparsity (many 
small coefficients are estimated as zero) to reduce model complexity and conti- 
nuity to avoid unnecessary variation in model prediction. Necessary conditions for 
unbiasedness, sparsity and continuity have been derived by Antoniadis and Fan [3] 
and Fan and Li [17]. A simple penalty function that satisfies all the three mathe- 
matical requirements is the smoothly clipped absolute deviation (SCAD) penalty, 
defined by 

00) -e|106 x9 TE» o) 
(4.2) T 
for some a > 2 and 0 > 0. 


Fan and Li [17] suggested using a — 3.7 from a Bayesian point of view and this 
value will be used in our numerical implementation. 

There are two issues related to the practical implementation of the procedure. 
First, to facilitate the implementation we use only one regularization parameter for 
all variables which can have very different scales. Thus, we need to standardize 
variables before using (4.1). Since each variable in (4.1) is used locally around a 
given point wo, its sample mean and standard deviation should be defined locally. 
For example, the variable Z; at the point wo can be standardized by 


1 n 
ave(Z1|wo) = " a Kp (Wi — wo)Zii 


i=l 


300 J. FAN, H. LIN AND Y. ZHOU 


and 
1 n 
var(Z1|wo) = N E K,(W; — wo)Zi, — ave(Zi w?n); 
i=l 


where N = 577 , Kn(W, — wo). The second issue is that the number of variables 
as a function of wo, if not constant, will be discontinuous. This will lead to dis- 
continuous estimates of coefficient functions. This may not be bad in terms of 
overall prediction error, but does not produce parsimonious and appealing models. 
To avoid this, we use a simple voting rule: if a coefficient function is estimated 
as zero over a certain percentage of grid points, delete its corresponding variable; 
otherwise keep the variable. In our implementation, we use the majority voting 
rule, namely, the thresholding percentage is taken as 50%. 


4.2. Oracle property. We now establish an oracle property of the penalized 
local partial-likelihood estimator. We assume without loss of generality that the 
first s variables of Z are significant and the last p — s variables are not significant. 
To state our main result more explicitly, we need the following notation. 

Recall that £ = (67,7, y)? . We divide 8 into (81,82)? , where à, and 62 are 
s x land (p—s) x 1 vectors, representing, respectively, the vanishing and nonvan- 
ishing coefficients. Corresponding to the partition of 6, we divide 7 into (ni, n y. 
Write 

£i = (85, m1. 30! = Gub s 1,25) 81,2541) 


and £j = (85, n4)". Let &19 = (81,10... €1,2541,0) > and £59 and &, be, re- 
spectively, the true values of £}, § and §. For example, &1,;0 = Bjo(wo) for 
j=1,...,8, 1,4,0 = B'g(wo) for j =s+1,..., 2s and Ẹ1,2s+1,0 = gg(wo). With- 
out loss of generality, assume that £59 = 0. Set 


ay (wo) = max{ p} (i, jol) : £1, jo £0), 


b, (wo) = max(p; (1&1, 4,0l) EGO Æ 0}. 
Let II, and Ay be, respectively, the submatrices of I(t, wo) and A(t, wo) in 
(A.10) and (A.16) in Appendix A that correspond to the rows in £. Corresponding 
to the partition of 6, let I! = (T7 ,, T75)7 with T—; and I'_z being s x p and 
(p — 5) x p matrices, respectively. 
The following theorem shows how the rates of convergence for the penalized 
local partial-likelihood estimates depend on the regularization parameter. 


THEOREM 6. Suppose that Conditions A.1—A.8 in the Appendix A hold. If 
by (wo) — 0, then there exists a local maximizer £ , of Q(§) such that It p^ &g|l| = 
Op(h? + (nh)! + a, (wo)). 


It is clear from Theorem 6 that by choosing a proper o, such that aj, (wo) = 
O ((nh)- V? + h?), there exists a (nh)! + h? consistent penalized local partial- 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 301 


likelihood estimator. Now we show that this estimator must possess an oracle 
property. 


THEOREM 7. Assume that the penalty function p, (0) satisfies 
. s ° . / 
(4.3) lim inf lim inf po (0)/0 > 9. 


Let Q — 0, {(nh)~!/2 + h?}/0 — 0 and an (wọ) = O((nh)-/? + h?). Under the 
conditions of Theorem 6, the consistent local maximizer & p= (È lo E i p) in Theo- 
rem 6 satisfies the following statements with probability tending to 1: 


(a) (Sparsity) We have £z, — 0. 
(b) (Asymptotic normality) We have 


VnhBy [mdi — § 10) 
(4.4) 


anny in (™f6O9)]] n 


where b = (p/,(lé1,1,01) sgm(&1,1,0), .... Pp (IE1,25-+1,01) sg0(E1,2541,0))’, Br = 
Ai — Hj, iH; , X, = diag(pZ(|5i 10D. .... Pa (l12s1,00), Bo(wo) = 


(B1o (wo), B20(wo), ---, Bso(wo), 0)", and Hy is a (2s + 1) x Qs + 1) diagonal 
matrix with first s elements 1 and the last s + 1 elements h. 


We now explain that the penalized local-likelihood estimators possess an oracle 
property when penalty functions are properly chosen. Suppose that there is an 
oracle who knows £2, = 0. She then uses this knowledge to estimate E, p» resulting 
in an oracle estimator. From Theorem 2, the asymptotic covariance matrix of this 
oracle estimator is TAT I(r, wo)A, .. For penalty functions such as SCAD, 
since Q — O, for sufficiently large n, 

A,(wo)=0 and b,(wo)=0 so b—0 and %,=0. 

Thus, Theorems 6 and 7 yield that E, = 0 and Hj (È 1p — $10) is asymptotically 
normal with covariance matrix ŁAC, wo)A;!, which is the same as the 
asymptotic variance of the oracle estimator (see Theorem 2). Furthermore, it can 
easily be seen that both estimators share the same asymptotic bias. Thus, the penal- 
ized likelihood estimators perform as well as the oracle estimator when the penalty 
functions are constant at the tails. In other words, when the true parameters have 
some zero components, they are estimated as 0 with probability tending to 1 and 
the nonzero components are estimated as well as the case where the correct sub- 
model is known. 


5. Numerical examples. 


5.1. Simulations. In this section we first compare the performance of the one- 
step and local partial-likelihood estimators. The performance of estimator ff (-) is 


302 J. FAN, H. LIN AND Y. ZHOU 


assessed via the weighted mean square error (WMSE), 


(5.1) WMSE = — Lo Y a [; wx) — Bj (wel, 


grid j—1 k=l 


or the unweighted mean square error (UMSE) with all a; = 1, where (wx, k = 
1,...,;Mgna} are the grid points at which the functions B(-) are estimated. In the 
following examples, the Gaussian kernel will be used, ngriq = 200 and, for WMSE, 
a, is reciprocal to the sample variance of (8j (wx)]. 


EXAMPLE 1. We first consider the varying-coefficient model A(t) = 4t? x 
exp(b(Zi(t), Z2, W)} with 


b(Zi, Z2, W) = 0.5W(1.5 — W)Z; + sin(2W)Z2 
+0.5{exp(W — 1.5) — exp(—1.5)}, 


where W is a random variable uniformly distributed on [0, 3], the covariate Z; (t) 
is time-dependent, defined as Z4 (t) = Zj/AI (t < 1) -- ZiI(t > 1), and Z; and Z2 
are jointly normal with correlation 0.5, each with mean 0 and standard deviation 5. 
The censoring random variable C given (Zi, Z2, W) is distributed uniformly on 
[0, a (Z1, Z2, W)], where 


a(Zi, Zo, W) — cA (b(Zi, Z2, W) > bo) + c21 (b(Zi, Z2, W) < bo), 


with bo being the mean function of b(Zi, Z2, W). The constants cı = 0.8 and 
c2 = 20 are chosen so that about 30-40% of data are censored in each region of 
the function a(-). 

We have conducted 200 simulations with sample size 300. Figure 1(a) depicts 
the distribution for the WMSE over the 200 replications, using the three band- 
widths A = 0.2, 0.5, 1. The initial value is chosen at grid points w20, weo, W100, 
w 149 and ujgo by the local partial-likelihood estimator just mentioned in Sec- 
tion 3.1. It is evident that the performances of the one-step local partial-likelihood 
estimator (one-step LPLE) and local partial-likelihood estimator (LPLE) are com- 
parable for a wide range of bandwidths. Figure 1(b)-(d) presents estimates of the 
coefficient functions from a typical sample (attaining the median WMSE perfor- 
mance) with h = 0.2. 

We now test the accuracy of our standard error formula given in Section 3.3. The 
standard deviations, denoted by SD in Table 1, of 200 estimated f (wo), B2(wo) 
and g’(wo), based on 200 simulations, can be regarded as the true standard errors. 
The average and the standard deviation of 200 estimated standard errors, denoted 
by SEave and SEgtg, summarize the overall performance of the standard error for- 
mula. Table 1 presents the results at the points w = 0.3, 0.75, 1.5, 2.25 and 2.7, 
which correspond to the 10th, 25th, 50th, 75th and 90th percentiles of the distrib- 
ution of W. The performance of the standard error formula is quite satisfactory. 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 


(a) Performance comparisons 


0.003 


0.001 


LPLE OS 





LPLE OS 


(c) Estimated curve for B, 





303 


(b) Estimated curve for p, 





0.0 0.5 


1.0 


15 20 2.5 3.0 
WwW 


(d) Estimated curve for g 





0.0 0.5 1.0 1.5 20 25 3.0 0.0 0.5 1.0 15 20 2.5 3.0 
W W 
FIG. 1. Simulation results for Example 1. (a) Boxplots for the distribution for the WMSE over 


the 200 replications, using the three bandwidths h = 0.2,0.5,1 (from left to right). (b), (c) and 
(d) Typical estimates of B1(-), B2(-) and g(-) with bandwidth h = 0.2 (solid line, true function; 
dashed line, one-step LPLE, i.e., OS). 


EXAMPLE 2. In the following examples, we evaluate the performance of the 
proposed variable selection method. Samples of size 300 were simulated from the 


TABLE 1 


0.30 


1.50 


2.10 


True and estimated standard errors using bandwidth = 0.2 for Example 1 


By (wo) 
SEave 


0.0573 
0.0479 
0.0414 
0.0343 
0.0385 


(SEsta) 


(0.0098) 
(0.0076) 
(0.0058) 
(0.0046) 
(0.0053) 


SD 


0.0655 
0.0579 
0.0473 
0.0232 
0.0321 


Ba (wo) 
SEave 


0.0479 
0.0337 
0.0236 
0.0197 
0.0222 


(SEgta) 


(0.0111) 
(0.0079) 
(0.0043) 
(0.0018) 
(0.0027) 


SD 


0.3831 
0.2779 
0.1910 
0.1873 
0.2491 


g’ (wo) 
SEave 


0.3735 
0.2967 
0.2457 
0.1602 
0.1474 


(SEsta) 


(0.0492) 
(0.0354) 
(0.0258) 
(0.0228) 
(0.0178) 


304 J. FAN, H. LIN AND Y. ZHOU 


Performance comparisons 





Oracle Scad Full 


FIG. 2. Boxplot for the distribution of the UMSE over the 200 replications, using bandwidths 
h = 0.3 and à = 0.3. 


hazard regression model 


A 
A(t) = exo Zjpj(w) + sen) 


j=l 


where fi(w) = 3(w — 2), pa(w) = 4cos(—£) and fs(w) = f(w) = 
g(w) = 0. The covariates Z1, Z2, Z3 and Z4 are jointly normal, all with mean 0 
and variance 2, and pairwise correlation 0.6. They are independent of W, which is 
uniformly distributed on [0, 3]. The censoring time follows the uniform distribu- 
tion on [0, 7] so that about 30—40% of the data were censored. The kernel function 
is Gaussian. 

The performance of the proposed variable selection technique is compared 
with that of the maximum local partial-likelihood estimator from the full model 
and from the oracle estimator, which is based on the model with only covariates 
Zı and Z». Figure 2 depicts the distribution for the UMSE over the 200 repli- 
cations, using bandwidths h = 0.3 and à = 0.3. It is evident that the proposed 
variable selection procedure outperforms the maximum local partial-likelihood es- 
timator and performs comparably with the oracle estimator. 

Using the majority voting (5096) rule, the variables Z3, Z4 and g(W) were 
simultaneously deleted 98.596 of the time among 200 simulations, and using 
a 60% thresholding level, the variables Z3, Z4 and g(w) were simultaneously 
deleted 92% of the time. Hence, only variables Z; and Z; remain. Their estimated 
coefficients are depicted in Figure 3 for a typical sample. 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 305 


(a) Estimated curve of p: (b) Estimated curve of Bo 





0.0 05 10 15 20 25 3.0 00 05 10 15 20 25 3.0 
W W 


FIG 3 The estimated coefficient functions (dashed lines) using the local partial-likelikood ap- 
proach with bandwidth h = 0 3 after deleting Z3, Z4 and g(w), as well as true lines (solid lines) 
and their 95% confidence bands (dotted lines) for Example 2. 


5.2. Data analysis. The proposed approaches are now applied to the nursing 
home data set analyzed by Morris, Norton and Zhou [29], where a full description 
of this data set is given. The data are from an experiment sponsored by the National 
Center for Health Services Research during 1980—1982 that involved 36 for-profit 
nursing homes in San Diego, California, with a sample of size 1601. 

The study was designed to evaluate the effects of different financial incentives 
on, among other things, the duration of stay. This motivated Morris, Norton and 
Zhou [29] to take days T in the nursing home as the response variable. They used 
the model 


7 
A(t, x) = Xo(t) exo( Zub) 


j=l 


where x; is a treatment indicator, being 1 if treated at a nursing home and 0 oth- 
erwise; x2 is a gender variable (1 for males and O for females); x3 is a marital 
status indicator (1 if married and O otherwise); x4, xs, xg are three binary health 
status indicators that correspond to the best health to the worst health; x7 is age, 
which ranges from 65 to 104. Morris, Norton and Zhou [29] fitted the Cox model 


306 J. FAN, H. LIN AND Y. ZHOU 


(a) B for gender (b) B for health1 (c) B for health2 
e Ww 
+t : j 
2 o 
E 
e? 
76 80 90 100 70 80 90 100 70 80 980 100 
age age age 
(d) B for healths (e) g(age) (f) g'(age) 
A o S 
9 
5 9 
> a ME 
a 9 
= y S 
70 80 90 100 70 80 90 100 70 80 90 100 
age age age 


FIG 4. The estimated coefficient functions (solid lines) via a local partial-likelihood approach with 
bandwidth h — 15 and their 9596 confidence limits (dotted lines) for the nursing home data without 
the treatment and marital covariates. 


with three parametric and one nonparametric baseline hazard model to this data 
set. Their model does not include any possible interactions between age and other 
variables. To explore possible interaction, Fan and Li [17] added interaction terms 
such as x7x;,X7x2,... in the initial model. With our newly developed technique, 
we can fit the more general model 


6 
Mt, x) = d 3.8, G)x, + se) 
j=l 
This permits us to examine how different age groups interact with covariates such 
as treatment, gender and marital status. In fact, as age increases, elderly people 
would expect to stay at nursing homes longer. Therefore, it is natural to introduce 
the term g (x7), the varying intercept. 

The local partial-likelihood method was applied to the data set with band- 
width A = 15, which was chosen by K-fold cross-validation [8, 25] to mini- 
mize the prediction error Ío (N;(t) — E N; (t))* d(931.. Nx (t)}, where E N,(t) = 
ft Y. (u) exp{B(W,)? Z, (u) + 8(W,))Ào(u) du is the estimate of the expected fail- 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 307 


ure number up to time t. We chose K — 20. Here, examination of the resulting es- 
timated coefficient functions and their 95% confidence bands (not presented here) 
suggests that variable treatment and marital status are not very significant. We 
therefore applied the variable selection technique to the data with A = 0.02 and 
bandwidth h = 15. The coefficient function for the treatment effect was estimated 
as zero at 89.596 of grid points, the coefficient function for marital status was esti- 
mated as zero at 97.9% of grid points and they were simultaneously estimated as 
zero at 87.5% of grid points. Thus, the variables treatment indicator and marital 
status were deleted. In other words, there is no significant treatment effect even 
when the more objective model (less restrictive model than in [29] and [18]) is 
used. Applying the local log partial-likelihood method (2.4) to the remaining five 
variables, we obtained estimated coefficient functions as in Figure 4 above. These 
functions depict the extent to which the gender effect and the health effect vary 
with age, and indicate clearly that the risk of staying at a nursing home depends 
on age. 


APPENDIX A: PROOFS 


A.1. Notation and conditions. For easy reference, we collect a set of nota- 
tion and conditions to be used. Let (2, F, P(g gay) be a family of complete prob- 
ability spaces provided with a history E = {¥;}, for an increasing right-continuous 
filtration 7; C F. We assume that W; is ¥;-measurable, and N,(u) and Z, (u) are 
F-adapted. Write F; = o {X; <u, Z (u), Wi, Y, (u), i —1,2,...,n,0xu < t} and 
M, (t) = N: (t) — fọ ài (u) du, i = 1,2, ..., n. Obviously, M, (t) is an 7; martingale. 

Let || - || denote the L2-norm and let || - ||; be the sup-norm of a function or a 
process on a set J. The support of the random variable W is denoted by W. For a 
compact subset Jw of W, we define the neighborhood set of Jw e as 


Jw e = lw: inf |w — wol <el 
wo€Jw 
for some e > 0. 
To facilitate technical arguments, we will reparametrize the local partial likeli- 


hood (2.4) via the transformation ¢ = H (E — £9). Hence, the logarithm of the local 
partial-likelihood function is 


£ (t, E) = £4(H7 1 £ + Eg, t) 
1.29 d 
2 Ka (W, — wo) 
x [£7 U* (u) + £A X? (u) — log Spo(u, £, wo)] d Mi (u) 


Ios a 
E m D K;,(W, — wo) lt? U*(u) EE £o X? (u) — log Sno(u, ¢, wo)] 
[zs] 


x Y, (u) exp(Bo(W,)? Zi (u) + go(W;))Ao(u) du, 


308 J. FAN, H LIN AND Y. ZHOU 


where U*(u) = H^!X*(u) and 


Sy (4, £, wo) = Y Ka (Wi — wo)Y, (u) exp(z" Ut (u) + E5 X? 00) (0? U), 


i=] 
0152. 
Furthermore, for each u € [0,7] and k = 0, 1,2, we write en (£) = ee, t) and 
define 
n 
Sz (u, 0, wo) = >) Ka (W, — wo)Y; (u) exp(8! (W)Z, Qu) + g(W;)) (Uf (uy), 
i=] 


where &(-) = (87 (-), 8'C)7 , g()7, 0¢-) = (87 (9, 2)? and wo € Jw. 
Let f (wo) be the density of the random variable W. In addition to the notation 
introduced before Theorem 2, we also define, for wo € Jwe, 


sé (u, 9, wo) = f (wo) Ele(u, Z(u), wo)|W = wo], 
s*(u, 0, wo) = f (wo) Elo(u, Z(u), wo) (Z7 (u), 0, 0)" |W = wo], 


$5 (u, 8, wo) = f (wo) E E Z(u), wo) exp(B (wo)! Zu) + g(wo)) 


Z(u)Z! (u) 0 0 
x | 0 Z(u)Z? (ujua, 2 w = J 


0 Z! (u)ua, u2 
and 


Sk(u, £, wo) 
= f (wo) J E[P (u, Z(u), wo) V (£, Eg, Z(u), y)Ru(y)™ |W = wo]K (y) dy, 


where k = 0, 1,2, Ry(y) = (Z! (u), ZT (u)y, y)” and 
Z 
vazn =oo(rro (1)). 
0 


To facilitate notation, the arguments @9(w) = (Bi (w), go(w)) , &£g(w), £9 0 
and wo are omitted in $7,(t,0, wo), Sax (f, £, wo), st (t, 0, wo) and s(t, £, wo) 
whenever there is no ambiguity. For example, 


Sa (D = Silt, wo) = S7, (t, 80, wo), SE (£) = sp (t, wo) = sz (t, 00, wo), 
Snk (t) = Snk(t, wo) = Snk (t, 0, wo), Sx (1) m Sk (1, wo) == Sk (t, 0, wo), 
Snk(f, £) = Snk (t, 6, wo), se(t, £) = sk (t, č, wo). 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 309 


CONDITION A. 


1. The kernel function K > 0 is a bounded, symmetric density function with com- 
pact support. 

2. The functions B(-) and g(-) have continuous second-order derivatives around 
the point wo. 

3. The density function f (-) of W is continuous at the point wo and f (wo) > 0. 

4. The conditional probability P(u, Z(u), -) 1s equicontinuous at wo and the co- 
variate Z(u) is continuous. 

5. We have nh — oo and nh? is bounded. 

6. We have f; Ao(t) dt < oo. 

7. (Lindeberg condition) There exists 6 > 0 such that 


= P 
(nh) /* sup |Z, (t)|¥, (t) 1 (Bo (wo)Z, (t) > —8|Z, (t)|) — 0, 
te[0,v], re 
where N = {1,2,...,7}. 


T 
8. (Asymptotic variance) The matrix a? — fy aone d Ao(u) is positive definite 
A2 a] 


at the point wo and the matrix (aT 2) is nonsingular at the point wọ. 
| 


Condition A will be used to derive the pointwise convergence properties of E 
and its asymptotic normality. Conditions A.1~A.5 are similar to those in [16] and 
Conditions A.7—A.8 are similar to Conditions C and D of [2]. Condition A.7 seems 
complicated, but can be easily verified in some important cases. For example, when 
the covariates Z are bounded, the condition is always satisfied; 1f the covariates Z 
are bounded by a random variable that has a bounded rth moment for some con- 
stant r > 2, the condition also holds. Other cases can be found in [2]. To derive the 
uniformly consistent result, Condition A needs to be strengthened as follows. 


CONDITION B. 


1. The kernel function K > 0 is a bounded, symmetric density function with com- 
pact support. 

2. The functions Bpo(-) and go{-) have continuous second-order derivatives 
on Jwe. 

3. The conditional probability P(u, Z(u), w) is equicontinuous in the argu- 
ments (u, w) on [0, c] x Jwe. 

4. The compact set Jw C W has the property infweJw, f (w) > 0 for some e > 0. 

5. The covariate process Z(u) has continuous sample paths in a subset Z of the 
continuous function space, and fy Ao(t) dt < oo and || fw ||; < oo. 

6. The function sọo(t, 0, wọ) is bounded away from O on the product space 
[0, r] x C x Jy,., that is, 


inf inf inf s9(t,0,w9) » 0 
te[0,7] (BT g)eC woe w,e 


310 J. FAN, H. LIN AND Y. ZHOU 


and 


sup sup £(Z(t)\* exp(p? Z(t) +g) « oo, 
tc[0,r] (8^ ,g)eC 


where C c R?*1, 
7. We have nh/ logn — oo and nh? is bounded. 
8. (Asymptotic variance) The matrix ay — [7 21628102" 7A 9(u) is positive definite 


ag (u) 
for any wo € Jw,¢ and the matrix (7 ay) is nonsingular for every wo € Jwe. 
I 


A.2. Proof of main results. Let 
n 
Ca(t) =n! Y Y (Og(W,, (W, — wo)/ h, Z, (0) Ky (W, — wo) 
i=] 
for a function g(-,-,-). 
LEMMA A.l. Assume that Conditions A.1 and AA hold. Suppose that 


gC, +, -) is continuous in its three arguments and that E(g(W, u, Z(t))|W = wo) is 
continuous at the point wo. If h — 0 in such a way that nh/logn — oo, then 


sup |C,(t) — C(t)| — 0, 


O<t<t 


where C(t) = f (wo) f E(Y (t)g(wo, u, Z(t))|W = wo) K (u) du. 
PROOF. Itis easy to show that for every t € [0, t], 


(A.1) IC, (t) — C(t)| — 0. 


Now we divide [0, r] into M subintervals [¢,-1,¢;],i = 1,2,..., M, with maxi- 
mum length 4. Then 


(A.2) max |Cy(t,) — C(t;)| — 0. 
l<i<M 
Note that 
sup |C,(t) — C(t)] 
O<t<r 
(A3) —— s max IC.) — CG) 


+ max sup |C,(t) — C(t) — (Ca(h—1) — C(51))]. 


1<t<M jtt |< 


The first term on the right-hand side is asymptotically negligible. We now deal 
with the second term. Write 


g(W, (W — wo)/h, Z) = gt (W, (W — wo)/h, Z)— 97 (W, (W — wo)/h, Z), 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 311 


where g*(-,-,-) and g~(-,-,-) are the positive part and negative part of g(., -,:), 
respectively. Correspondingly, we decompose C,(t) into Cf (t) and C; (t). We 
only need to show that 


max sup |Ci(0—C£i(ü-)|- max sup [CHH — C*(ti 1) 
T IxixM edes i Pus sr M Tu 


E IN 


and a similar result for C, (t). We now focus on (A.4). It will be shown in Appen- 
dix B that 


(A.5) max sup |C7()-— CZ (ti 0] 0. 
Ii M |t—1_1|<3 


On the other hand, we have 


max sup |C*(t)—Ct(t_1)| 


15i EM |p—-1,_1] <8 


< max sup f(wo) | ELV OLe* (wo, v, Zt) 


i<isM \t—f;1|<d 
— g* (wo, u, Z(t-1))]|W = wo] 
x K(u) du 


(A.6) 


+ max sup 
I Si £M tt |o 





| EUG 3 € X <)g* (wo, u, Z(t,-1))|W = wo] 


x K(u) du}, 
which tends to zero as 6 — 0. Hence (A.4) holds. This completes the proof. D 


LEMMA A.2. Assume that g(w,u, Z(t)) is equicontinuous in its arguments 
w and u, and that E(g(wo,u, Z(t))|W = wo) is equicontinuous in the argu- 
ment wo. Under Conditions B.3 and 4, we have 


sup sup |C,(t, wo) — C(t, wo)| — 0, 
O<t<r wEB 


where B is a compact set that satisfies nfyep f (w) > O. 


The proof of Lemma A.2 is similar to that of Lemma A.1 and is omitted. 


LEMMA A.3. Let C and D be compact sets in R? and RP, and let f (x,0) be 
a continuous function in 0 € C and x € D. Assume that 09(x) is continuous in 


312 J. FAN, H. LIN AND Y ZHOU 


x € D and is the unique maximizer of f (x,0). Let 0, (x) € C be a maximizer 


Of fs (x, 9). If 


sup [/a(x,8) — fœ, 0) — 0, 
0cC,xeD 


then 
sup Ên (x) — Bo (x)| — 0. 


xcD 


The proof of Lemma A.3 can be found in [11]. 


LEMMA A.4. Under Condition A, we have for k — 0, 1,2, 
n—'S* (u) =sf(u) + op(1), 
uniformly for u € (0, x], where sz (u) = sf (u, 00, wo) and 


sup |n7 LS, (u, 0, wo) — st(u, 0, wo)] = op(1), 
uc(0,r] 


where 0 lies in a neighborhood of 00 for fixed wo. In addition, we have for each €, 


sup |n! Sne (u, £, wo) — sk (u, £, wo)ll = op(1), 
ue€(0,r] 


where ¢ lies in a neighborhood of 0 for fixed wo. Furthermore, under Condition B, 
we have 


In Sz, — sella = op, 
where R = [0, v] x € x Jw ẹ and a similar result holds for Sną(u, €, wo). 


The results of Lemma A.4 can be easily proved along similar lines to the argu- 
ments establishing Lemma A.1. 


PROOF OF THEOREM 1. The first result of Theorem 1 follows from the first 
step in the proof of Theorem 2. Now we only prove the second result of Theorem 1. 
By an argument similar to that in the first step 1n the proof of Theorem 2, we easily 
prove from Lemma A.2 that 

sup sup sup |£,(,£) — £4(,0) - Y(t, £)) — 0 


t€[O,r] E£geC* wocJw 


in probability; here C* is a convex and compact set of R?P*1. Therefore, it follows 
from Lemma A.3 that sup, cj, |£| — O in probability. The proof is complete. 
Lj 


PROOF OF THEOREM 2. We first prove that A/nhH( (wo) — ég(wo)) is as- 
ymptotically normal with mean h^e p$ o (wo) u2/2 and covariance Z(t, wo). Now 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 313 


we divide the proof of the asymptotic normality of /nhH(& (wo) — §9(wo)) into 
three steps. The first step is to show that H(£(wo) — £g(wo)) — 0 in probability. 
The second step is to establish the asymptotic normality of the first derivative of 
the local partial likelihood. The third step is to demonstrate that the Hessian ma- 
trix of the local partial-likelihood function converges to a positive definite one. 
Theorem 2 will then be proved by combining the results in these three steps. 


(a) We first show that ¢ — 0 in probability, where t= HE — £9). It is easy to 
show that 


£n (t, £) — £x (1,0) 





E zu u Tak u Snotu, £) 
6E Í K,(W, wo)| Ur (u) log oe | M, G) 
n) Lt g D ft) Spots 8) ce va, 
ez [srw tudu- | log eu, 0) OCON du 
= Xn (t, č) + Yat, C). 
By Lemma A.1 we obtain that 
t E i S (u, C) * 
Y. (t, $) = Í (st (u))" £Xo(u) du — f log oy 0 0900 du + op 1) 


= Y(t, č) +0p(1). 


In Appendix B, we will show that Y (t, £) is a strictly concave function in ¢ and 
has maximum value at ¢ = 0. The process X,(t, £) is a local square integrable 
martingale with the square variation process 


Dn (t) = (Xn C, E), Xn C, E)E) 


S a E E [E 
-u | rim uo) £7 Ut) — log( OE 


x Y, (u) exp(Bo(W,)" Z, (u) + g0(W,))Ao(u) du. 
It follows from Lemma A.1 that 
EX?(t, £) = E Dr (t) = O((nh) |) — 0, O<t<t. 
Hence, we have that 
Enlt, €) — Enlt, 0) = YCE, E) + Op((nh) 1). 


Obviously, en(t, £)— £, (t, 0) is strictly concave in ¢ with the maximizer ĉ. By the 
concavity lemma it follows that ¢ — 0, the maximizer of Y (t, £) in probability. 


314 J. FAN, H. LIN AND Y. ZHOU 


(b) We now show that Vnh(£, (ct, 0) — B,(r, wo)) is asymptotically normal 
with mean zero and covariance X(t, wo), where the definitions of B, (t, wo) and 
X (t, wo) can be found below. 

Observe that 


no(u, wo) 


2 z l| $T Sni (t, 
£O =E, 0 - - Y [. KW, - wo)| UF) - cru dM;(u) 
pl 


P S Sn1(u, wo) 
"d RON wo)| Ur Sno(u, d 
x exp(&9(W,)^ Z, (u) + go(W,))Y, (u)Ao(u) du. 


Let us denote the above two terms, respectively, by 7, (v, 0) and h(t, 0). We first 
deal with I5(v, 0). By Taylor expansion we have 


exp(Bo(W,)’ Zi (u) + go(W,)) — exp(&o X7 + go(wo)) 
(A.8) = 5 exp(tĝ X7 + g0(wo)) [BG (wo)? Za (u) + gg (wo)] 
x (W, — wo)" (1 + Op(A)). 
Note that 


B l n T X + u Sni (u) 
he. - LY [ KW -voh — 975) 


x [exp(Bo(W.)" Z, (u) + go(W.)) — explo X7 + 2o(wo))] 
x Y, (u)Ao(u) du. 


Then it follows from Lemmas A.1 and A.4 that 





h(t, 0) = moh Ky (WW, — wo)| UF) — vad 


x Y,(u) exp(&4 X? + go(wo)) [BG (wo)? Z, (u) + gg (wo)] 
x (W, — wo)^Ao(u) du(1 + op (h)) 


L(u) {2 st 
-IM f (wo) A e| (ss )- | p(u, Zu), wo) 
So 


x [BE (wo)? Z(u) + gp (wo)]|W = wo 


x Ao(u) du(1 + Oy (h)), 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 315 
where sz (u) = sz (u, 00, wo) for k = 0, 1,2. Since K(-) is a symmetric function, 
which implies 443 = 0, simple algebra shows that 

In(t, 0) = 5h” ua f (wo) 


t | uü — 81 MU 
x i E 0 
0 0 


Bo (wo) 
(A.9) xp(u, Z(u),wo(Z^,0,D| 0 d Ao(u) 
g” (wo) 


= 1h? uses T7 Bine (1 + Op). 


Let us denote the term in (A.9) by B, (t, wo). 
We now derive the asymptotic normality of the term Jı (r, 0). Let I(t) = 


/nhi,(t, 0). Then 





* pk _A : i 2 Sni (4) = 
Qr Y - - I, Ea = wo)| Uta) " = 


x Y, (u) exp(Bo(Wi)! Z, (u) + g0(W;:))ào(u) du. 
By Lemma A.1 and using Conditions A.1 and A.8, it can be shown that 
II(r, wo) 
— HH * px 
= lim E(If, If)(x) 
= f (wo) 


t (Z(u) — a, (u)/ag(u))® vo 0 0 ^ 
(A10) x | E 0 ZZ! (vs 2) 
i 0 ZT (u)v v2» 


x p(u, Z(u), wo)| W = «| d Ao(u) 


Tuy O 0 
= 0 82V? amj. 
0 aiv aor 
By Condition A.7 and a proof similar to that of Anderson and Gill [2], it is easy to 
prove that the Lindeberg condition for the process 77 (t) holds. By the martingale 


central limit theorem, we derive that /7'(t) is asymptotically normal with mean 
zero and covariance II (t, wo). Hence 


(A.11) V/nh(£, (0) — B, (v, wo)) —> N (0, I(r, wo)). 


316 J. FAN, H. LIN AND Y. ZHOU 


(c) We will show that the second derivative of the logarithm of the local partial- 
likelihood function converges to a finite constant matrix. Since ¢ — 0 in probabil- 
ity, by the mean-value theorem we have that 


(A.12) ££) = £59) + o1). 
Since sy (u) = sk(u) exp(go(wo)), k = 0, 1, 2, from Lemma A.4, we can obtain 
> (u)so (u) — st Qu) GT Q2) ^ 
£o f K (Ww, — T sid bas aN, 1). 
(0) Y nf TI (u) + op (1) 
Write F(u) = P(X <u, A=1|W = w) and its corresponding empirical condi- 


tional measure, 


Fy(u) = — LY ki, - wo) (Xi < Su A c1. 
es 1 


By kernel smoothing techniques, we easily prove that 


T of * NEM. * T n 
£o - - | $5 (u)so (u) ; 51 MU (u)) dF, (u) + op (1) 
(so (4)) 
(A.13) 
= —A(T, wo) + op(1), 
where 


n AT 


It is easy to show that A(t, wo) is Boa definite. 

(d) Combining the results in steps (a), (b) and (c), we can establish the asymp- 
totic normality of VnhH(Ê (wo) — £(wo)). In fact, since € maximizes £,(£), by 
Taylor expansion around 0, we have 


—£ (0) = £ (£) — £ (0) = (£/ (£7 €, 


where A lies between 0 and £. Hence d — 0 in probability. It follows 
from (A.13) that 


€ — A(t, wo) Bn (T, wo) 
= — (£5) (£,(0) — Bir, wo)) + op (1). 
Combining (A.11) with (A.13), by Slutsky’s theorem we obtain that 
vV/nh(£ — A(t, wo) Bs (t, wo)) 
— N(0, A^ (t, wo) (rz, wo) (Al (t, wo))’ ). 


dFy(u). 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 317 


Now we simplify the matrix A(t, wo). Obviously, by a simple calculation we 
have 


Z(u)Z! (u) 0 0 
s3 (u) = f(wo)E ( 0 Z(u)Z! (u)ua 23 
0 Z’ (u) uo H2 
(A.14) x p(u, Z(u), wo)|W = | 


82 (u) 0 0 
= ( 0 — a (wu T 
0  aj(w)uo ao(u)u2 
similarly, we obtain that 


aj(u)a! (u) 0 0 
(A.15) T 0 0 o). 
0 0 0 


Note that sg (u) = ao (u). Hence it follows from (A.14) and (A.15) that 


y! 0 0 
(A.16) A(t, wo =] 0 ay aiu |. 
0 alu am 


Hence, the asymptotic bias of the estimator E (wo) 1S 
b(t, wo) = A^! (t, wo)Bn(z, wo) 
= h*e,&5 (wo) 2/2 
and the asymptotic covariance is 


E(t, wo) = AW! (t, wo) T(z, wo) (A^ (v, wo))” 


=! or a2 81|  -2 
° & E qr ne 
=( y 0 ) 
«X97 Qu;^wJ: 
This completes the proof. C 


PROOF OF THEOREM 3. We have shown from (A.13) that 
(A.17) ££) = —A(, wo) + op(1) 


for any P between zero and ĉ = H(É — £g). By Theorem 2, f= Op (h? + 
(nh)~'/?). Thus, for any ĉ* = O,(h? + (nh)~*/2), (A.17) holds. 


318 J. FAN, H. LIN AND Y. ZHOU 
By Taylor expansion of £^ (£9) at ¿o = 0, we have 
(A.18) En Êo) = E Eo) + EG Eo — So), 
where ĉo = HE, — £o) and a = HÊ; — 6g), in which EO lies between & and Êo. 
By definition of the one-step estimator and (A.18), we have that 
Eos — $0 = bo — $0) — C569) En Eo). 
Using (A.17), we have 
bos — £o = (7 — G5 G9) EE) Eo — $0) — Er EoD E Co) 
= (Ep Eo En Eo) + op Êo — £o) 
= —(5 EoD) En So) — Bav, wo)] — £n (Eo) Bav, wo) 
+ op((nh)- V? +h’). 
It follows from (A.11) and (A.13) that m has the same asymptotic distribution as 


the maximum local partial-likelihood estimator. This yields Theorem 3. |] 


PROOF OF THEOREM 4. By the same argument as that of Lemma A.1, we 
have 


(A.19) sup sup nA, (t,0) — Arlt, 09)| —> 0 
FEL9,7] 189—861 <|16—-8o| 


in probability, where 


An(t,0) = $I (W; € Jw)Yi(0 exp(8^ (WDZ (t) + (W.)), 


i=l 


where 0 = (8^ C), g(-))". 
By definition of Ag(t), we have 


A tr] 1 : ty dN, 
Aa — ^07 | [5 7 2,85] 4^ * | {acy 4^9] 
" [ An (Â) — An(O0) a f An (Ô) — An (O0) |. 
0 AÊ) 0 An(6)An(60) 


d 1 
fi toa, 
0 Agn(00) 


where N, = >-"_, N, and M, = Y , M,. From (A.19) it is easy to see that the 
first term converges to zero in probability uniformly on (0, r] as n — oo. The last _ 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 319 


two terms of the above expression are square integrable local martingales with 
variation processes 


t (A5 (0) — An (00))? t 4 
dg and ied Ka. 
i (AO ASQ O T J, A&(80) 


respectively. Since A4 (00) = O, (n), the above variance processes converge to 
zero in probability uniformly on (0, r] as n — oo. The terms converge to zero in 
probability uniformly on (0, r] by an argument similar to that of Andersen and 
Gill [2] via the Lenglart inequality. Therefore 


Ao(t) —> Ao(t) 


uniformly on (0, r]. Thus, we can prove by the standard argument of kernel esti- 
mation that 


Ao(t) — Ao() 

uniformly on (0, t]. C] 

PROOF OF THEOREM 5. From the proof of Theorem 2, we easily show that 
this theorem holds. C] 

PROOF OF THEOREM 6. Using the same proof as in Theorem 2, we can get 

E (£y) = Op((nh) 1? + h?). 

Let a, = (nh)? +k? + an. Following the same lines as the proof of Theorem 1 
of [17], the result follows. LJ 


LEMMA A.5. Suppose that the conditions of Theorem 6 hold. Then with prob- 
ability tending to 1, for any given &, satisfying ||&, — &19gll = Op((nh) |? +h?) 
and any constant C, we have 


Q((&1,0)) = OET, &2)^). 


max 
[£21 & CIR) 172-57] 
PROOF. Froman argument similar to that in step (b) in the proof of Theorem 2, 
it is easy to show that 
£, 69) = Op((nI) 7 +h’), 
and by an argument similar to that in step (c) of the proof of Theorem 2, we have 
£7 (&o) = O, (1). 
The result follows from the the proof of Lemma 1 of [17]. O 


PROOF OF THEOREM 7. It follows from Lemma A.5 and Theorem 6 that the 
first result of Theorem 7 holds. Now we prove the second result of Theorem 7. It 


320 J FAN, H. LIN AND Y. ZHOU 
can be aay shown that there exists a É, as in Theorem 6 that is a local maximizer 
of QET , 0)? , and that satisfies the likelihood equations 


30G) E 
98, ig-(£,,0) 


Using the Taylor expansion of (2Q(£))/88, at point £y and noting that Ê, is a 
consistent estimator from Theorem 6, we have 


85, o) | (9?1,(59) T" 
7» rr to(0)6 2) 


-b — (Xi--o5 (0), — £10) — 0. 
From the proof of Theorem 2, it is easy to show that 





(A.20) 


Oln I —1 gif 
Vai (n^ zl zh aep lg (wo)(1 +. os) — N (0, I(t, wo)) 


and 
zz] 0^ £, (Eo) 


H~! — -A T, Wo). 
ag ag? ( 0) 





Thus, we have 


1 9€n(—g) 1,4 T ifo (wo) 
vnb (H; T oh m( i ) (1+ op) 





(A.21) 
N (0, II; (r, wo)) 
and 
18 ln 9) 1 
A.22 LEHT! — —A1(r, wo). 
(A.22) H; ag ae! 1(T, wo) 


By some simple calculations, we easily show that the second result of Theorem 7 
follows from (A.20), (A.21) and (A.22). O 


APPENDIX B 


Concavity and maxima of Y (t, B). Here we prove that Y (c, ¢) defined by (A.7) 
is concave with respect to ¢. Differentiating the function Y(t, £) with respect to ¢, 


we have 
Y t t , x 
Q (t, t£) = | s* (u)Ao(u) du o [ S1 (u a (u)Ào(u) du, 








ac solu, £) 
92 Y (z, c) t s;(u, £)so(u, ¢) — (si(u, £))8?. , 
EDD ems - f e G — — 10 (u)Ao(u) du. 


By the integral transform and the fact that aa’ + bb! > 2ab! for any vectors a 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 321 


and b, we can show that 
9^ Y (1, 0) 
ac i 
Again by sý (u, 0) = s (u, 09) exp(go(wo)), k = 0, 1, we have 
ƏY (r,0) L0 
0g 


Hence ¢ = 0 is the maximizer Y(t, £). 


0. 


PROOF OF(A.5). ]Itis easy to show that 
max sup |C;(0)-— Cj (&-)l € +A, 


Ixi «M It—t,—1| «à 


where 


h= max sup |n!) Y,QOg* (W,, QV, — wo)/ h, Z,()Kn(W, — wo) 


ISSM terale] jay 


— n7! > Y, (02g* (W,, (W; — wo)/ h, Z, (ti-1)) 
yes] 


X K&(W, — wo) 
and 


n 
h= max sup |n ^! 'Y,(Dg*(W;, (W, — wo)/ h, Z, (4-1) 
ISSM tenas] 1-1 


x K (W, — wo) 


— n! Y Y, (ti—1)8(W;, (W, — wo)/h, Zj(ti-1)) 


i=] 
x K4(W, — wo)|. 
Note that Z;(t) (j = 1,2,...,n) is continuous on [0, r]. Thus we easily obtain 


that 
Jj x max sup sup |g*(W,, (W, — wo)/ h, Z, (t) 


ISJ En te[0,] |t—1,_1|<8 


Tm £g (W,, (Wj Nen wo)/ h, Z, (t—1)) 


n 
x sup n^! YY, ()Ka(W, — wo), 
te[0,t] j=l 


322 J. FAN, H. LIN AND Y. ZHOU 


which tends to zero in probability. Since Y, (t) is a decreasing function of t, we 
have, for any € > 0, 


n 
$a Xj « fi) 


P(Jp>6)<MP (^ 
j=! 





x gt (W,, (W; — wo)/h, Z)) K&(W, — wo) 





>e); 


n 
n $ ICi- < Xj < ti)g (Wj, QV, — wo)/h, Zj(4—-1)) Kn(W, — wo) 
j=l 


It is easy to show that 


Ey (wo) | E(I(t 1 < X «&)g (wo, u, Z;j(ti 1W = wo)}K (u) du. 
On the other hand, 
E(I(ti 1 < X < t)g^ (wo, u, Z(t, )))W = wo) 
< EPI (4 < X <t,)|W = wo) 
x E^ [g^ (wo, u, Z(t, ))|W = wo] 
= |P(X <t)-1|W = wo) — P(X < |W = wo)| ^ 
x EV (gt? (wo, u, Z(t,-1))|W = wo) 


<E 


as |t; — t;-1| < ó. Hence 
n 
Y l(ü-1« X, «t) 


p(n- 
j=l 
<P( 





x g* (Wj, (Wj — wo)/ h, Z (t, -))) Ka (W; — wo) 





>e) 


x gt (W,, (W, — wo)/h, Zi (& 3) KAu(W, — wo) 


n 
Yoa ra < Xj «u) 
j=l 





— f (wo) J E(Htj-1 « X «1) 


x gt (wo, u, Z(t,-1))|W = wo)K (u) du 


> sn) 





LOCAL PARTIAL-LIKELIHOOD ESTIMATION 323 
EE P( fto» f1E G «X <t) 


x gt (wo, u, Z(tj .1))|W = wo)K (u) du| > 2) 
<n. 
Hence for any 7 > 0 and € > 0 there exists No such that for n > No we have 
(B.1) P(J, + Jo > €) < 2r. 
Therefore, we obtain that 


(B.2) P( max sup IC; (t) — C7 (t, 4) > e) « 2n. 


1<1<M |t~-4,_1|<6 


This completes the proof of (A.5). L 


Acknowledgment. The authors are grateful to the reviewers for constructive 
comments that have significantly improved the presentation of the paper. 


REFERENCES 


[1] ANDERSEN, P K , BORGAN, Ø., GILL, R. D. and KEIDING, N. (1993). Statistical Models 
Based on Counting Processes. Springer, New York. MR1198884 

[2] ANDERSEN, P. K. and GILL, R. D. (1982). Cox's regression model for counting processes: 
A large sample study. Ann. Statist. 10 1100-1120. MR0673646 

[3] ANTONIADIS, A. and FAN, J (2001). Regularization of wavelet approximations (with discus- 
sion). J. Amer. Statist. Assoc. 96 939-967. MR1946364 

[4] BJERVE, S., DOKSUM, K. A. and YANDELL, B S. (1985). Uniform confidence bounds for 
regression based on a simple moving average. Scand. J. Statist. 12 159-169. MR0808152 

[5] BREIMAN, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 
37 373-384. MR1365720 

[6] BRESLOW, N. (1972) Discussion of “Regression models and life-tables,' by D R. Ccx. J. Roy. 
Statist. Soc. Ser. B 34 216-217. 

[7] BRUMBACK, B. and RICE, J. A. (1998). Smoothing spline models for the analysis of nested 
and crossed samples of curves (with discussion). J. Amer. Statist. Assoc. 93 961—994. 
MR1649194 

[8] CAI, Z., FAN, J. and LI, R. (2000). Efficient estimation and inferences for varying-cpoefficient 
models. J. Amer. Statist. Assoc. 95 888—902. MR1804446 

[9] CAI, Z., FAN, J. and YAO, Q. (2000) Functional-coefficient regression models for nonlinear 
time series. J Amer Statist. Assoc. 95 941-956 MR1804449 

[10] CAI, Z. and SUN, Y. (2003). L ocal linear estimation for time-dependent coefficients in Cox’s 
regression models. Scand. J. Statist 30 93-111. MR1963895 

[11] CARROLL, R. J , FAN, J., GUBELS, I. and WAND, M. P. (1997). Generalized partially linear 
single-index models. J. Amer. Statist. Assoc. 92 477—489. MR1467842 

[12] CARROLL, R. J., RUPPERT, D. and WELSH, A H. (1998). Local estimating equations 
J. Amer. Statist. Assoc. 93 214—227. MR1614624 

[13] CHEN, R. and TSAY, R. S. (1993). Functional-coefficient autoregressive models. J. Amer Sta- 
list. Assoc. 88 298—308. MR1212492 


324 J. FAN, H LIN AND Y. ZHOU 


[14] CLEVELAND, W. S., GROSSE, E. and SHYU, W. M. (1992). Local regression models. In 
Statistical Models in S (J M. Chambers and T. J. Hastie, eds ) 309—376. Wadsworth and 
Brooks/Cole, Pacific Grove, CA. 

[15] FAN, J and CHEN, J. (1999). One-step local quasi-likelihood estimation. J R. Stat. Soc. Ser. B 
Stat. Methodol. 61 927-943. MR1722248 

[16] FAN, J., GUBELS, I. and KING, M. (1997). Local likelihood and local partial likelihood in 
hazard regression Ann. Statist. 25 1661-1690. MR1463569 

[17] FAN, J. and L1, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle 
properties. J Amer. Statist Assoc. 96 1348-1360 MR1946581 

[18] FAN, J and Lt, R (2002). Variable selection for Cox's proportional hazards model and frailty 
model. Ann. Statist. 30 74-99 MR1892656 

[19] FAN, J. and ZHANG, W (1999). Statistical estimation in varying coefficient models Ann. Sta- 
tist 27 1491-1518. MR1742497 

[20] FLEMING, T. R. and HARRINGTON, D. P. (1991). Counting Processes and Survival Analysis. 
Wiley, New York. MR1100924 

[21] GAMERMAN, D. (1998). Markov chain Monte Carlo for dynamic generalized linear models 
Biometrika 85 215—227. MR1627273 

[22] HARDLE, W. (1989). Asymptotic maximal deviation of M-smoothers. J. Multivariate Anal. 29 
163-179. MR1004333 

[23] HASTIE, T and TIBSHIRANI, R. (1990). Exploring the nature of covariate effects in the pro- 
portional hazards model. Biometrics 46 1005—1016. 

[24] HASTIE, T. J. and TIBSHIRANI, R. J. (1993) Varying-coefficient models (with discussion). 
J. Roy. Statist Soc. Ser. B. 55 757—196. MR1229881 

[25] HOOVER, D. R., RICE, J A., WU, C O. and YANG, L.-P. (1998). Nonparametric smooth- 
ing estimates of time-varying coefficient models with longitudinal data. Biometrika 85 
809-822 MR1666699 

[26] JOHNSTON, G. J. (1982). Probabilities of maximal deviations for nonparametric regression 
function estimates. J Multivariate Anal. 12 402-414. MR0666014 

[27] MARTINUSSEN, T., SCHEIKE, T. H. and SKOVGAARD, I. M. (2000). Efficient estimation of 
fixed and time-varying covariate effects in multiplicative intensity models. Unpublished 
manuscript. 

[28] MARZEC, L and MARZEC, P (1997). On fitting Cox’s regression model with time-dependent 
coefficients. Biometrika 84 901-908. MR1624984 

[29] Morris, C N., NORTON, E. C. and ZHOU, X. H (1994). Parametric duration analysis 
of nursing home usage. In Case Studies in Biometry (N. Lange, L. Ryan, L. Billard, 
D. Brillinger, L. Conquest and J. Greenhouse, eds.) 231—248. Wiley, New York. 

[30] MURPHY, S A. (1993). Testing for a time dependent coefficient in Cox's regression model. 
Scand. J Statist. 20 35-50 MR1221960 

[31] MURPHY, S. A. and SEN, P. K. (1991) Time-dependent coefficients 1n a Cox-type regression 
model. Stochastic Process. Appl. 39 153-180. MR1135092 

[32] TIAN, L., ZUCKER, D. and WEI, L. J. (2002). On the Cox model with time-varying regression 
coefficients. Working paper, Dept. Biostatistics, Harvard Univ 

[33] TIBSHIRANI, R. (1996). Regression shrinkage and selection via the lasso. J Roy. Statist. Soc. 
Ser B. 58 267—288 MR1379242 

[34] TIBSHIRANI, R. J. (1997). The lasso method for variable selection in the Cox model. Statistics 
in Medicine 16 385—395. 

[35] Wu, C O. and CHIANG, C -T. (2000). Kernel smoothing on varying coefficient models with 
longitudinal dependent variable. Statist. Sinica 10 433-456 MR176975] 

[36] Wu, C. O., CHIANG, C.-T. and HOOVER, D. R. (1998). Asymptotic confidence regions for 
kernel smoothing of a varying-coefficient model with longitudinal data. J. Amer. Statist. 
Assoc. 93 1388-1402. MR1666635 


LOCAL PARTIAL-LIKELIHOOD ESTIMATION 325 


[37] ZUCKER, D M. and KARR, A. F. (1990). Nonparametric survival analysis with time- 
dependent covariate effects: A penalized partial likelihood approach. Ann. Statist. 18 
329-353. MR1041396 


J. FAN H. LIN 

DEPARTMENT OF STATISTICS SCHOOL OF MATHEMATICS 
CHINESE UNIVERSITY OF HONG KONG SICHUAN UNIVERSITY 

SHATIN CHENGDU, SICHUAN 610064 
HONG KONG PEOPLE'S REPUBLIC OF CHINA 
AND 


DEPARTMENT OF OPERATIONS RESEARCH 
AND FINANCIAL ENGINEERING 

PRINCETON UNIVERSITY 

PRINCETON, NEW JERSEY 08540 

USA 

E-MAIL’ jqfan@princeton edu 


Y. ZHOU 

INSTITUTE OF APPLIED MATHEMATICS 

ACADEMY OF MATHEMATICS AND SYSTEM SCIENCE 
CHINESE ACADEMY OP SCIENCE 

ZHONG GUANCUN, BRIJING 100080 

PEOPLE'S REPUBLIC OF CHINA 

E-MAIL: yzhou@amss.ac.cn 


The Annals of Statistics 

2006, Vol 34, No 1, 326—349 

DOI 10 1214/009053605000000787 

Q Institute of Mathematical Statistics, 2006 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY 
STRUCTURES IN A BACKGROUND OF UNIFORM 
RANDOM POINTS! 


BY ERY ARIAS-CASTRO, DAVID L. DONOHO AND XIAOMING HUO 


University of California, San Diego, Stanford University and 
Georgia Institute of Technology 


We are given a set of n points that might be uniformly distributed in 
the unit square [0, 1]*. We wish to test whether the set, although mostly 
consisting of uniformly scattered points, also contains a small fraction of 
points sampled from some (a prior unknown) curve with C*-norm bounded 
by 8. An asymptotic detection threshold exists ın this problem; for a constant 
T... (a, B) > 0, 1f the number of points sampled from the curve is smaller than 
T_ (a, B)n (1.9). reliable detection 1s not possible for large n. We describe 
a multiscale significant-runs algorithm that can reliably detect concentration 
of data near a smooth curve, without knowing the smoothness information 
a or f in advance, provided that the number of points on the curve exceeds 
Ty (a, B) (+a). This algorithm therefore has an optimal detection thresh- 
old, up to a factor 7,/ 1... 

At the heart of our approach 1s an analysis of the data by counting mem- 
bership in multiscale multianisotropic strips. The strips will have area 2/n 
and exhibit a variety of lengths, orientations and anisotropies. The strips are 
partitioned into anisotropy classes; each class is organized as a directed graph 
whose vertices all are strips of the same anisotropy and whose edges link 
such strips to their “good continuations.” The point-cloud data are reduced to 
counts that measure membership 1n strips. Each anisotropy graph is reduced 
to a subgraph that consist of strips with significant counts. The algorithm re- 
jects Hp whenever some such subgraph contains a path that connects many 
consecutive significant counts, 


1. Introduction. 


We cannot help but see faces and castles in clouds, monsters in ink-blots and exotic 
forms in random dots. Form is so central to human perception that, I am told, it is 
extremely difficult to prove something random or formless. Mae-Wan Ho [20]. 


Suppose we have n data points X, € [0, 1]? which at first glance seem uniformly 
distributed in the unit square. On cursory visual inspection, it seems that a suspi- 
ciously large number of the data points fall along a smooth curve. However, the 


Received July 2003, revised March 2005. 
I Supported ın part by NSF Grants DMS-00-77261, DMS-01-40587 (FRG) and DMS-05-05303, 
ONR MURI, and a contract from DARPA ACMP. 
AMS 2000 subject classifications, Primary 62M30; secondary 62G10, 62G20 
Key words and phrases. Multiscale geometric analysis, pattern recognition, good continuation, 
Erdos-Rényi laws, runs test, beamlets. 


326 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 327 


curve on which these points lie has only been identified after inspection of the 
data. We know that the human visual system has the ability to “hallucinate” curvi- 
linear structure in truly random point clouds. We are therefore concerned about the 
reliability of the perceived pattern and wish to follow an objective procedure for 
testing the existence of filamentary structure: this procedure should reliably sepa- 
rate filamentary structure from random scatter and be computationally tractable. 

This is a prototype for various practical imaging problems that range from sur- 
veillance to road and streambed tracking to particle physics [1, 31, 34]. In all cases, 
the observer is looking for evidence of a filamentary structure in a background of 
heavy clutter. 

As a first attempt to formalize matters, consider the problem of testing 


Ho: x Uniform(0, 1)”, 
versus 
H; (a, B) : X; ^ (1 — e,)Uniform(0, 1)? + e, Uniform(graph(f)), 
where f c Hólder(a, B) is unknown. Here, for 1 < a < 2, Hólder(a, P) is the class 
of functions g : [0, 1] — [0, 1] with continuous derivative g’ that obeys |g'(x) — 
g'(y)| x aBIx — y[*^!. In words, we believe that a relatively small fraction &, of 
points lie on a smooth curve in the plane. 


1.1. “Connect the dots." In our previous work [5], it was shown that when 
a and f are fixed and known, there is a detector based on the principle that, un- 
der Ho, no Hólder(o, B) curve can pass through a very large number of points 
in a random point cloud. More particularly, we know that there is a threshold 
T4 = T4 (a, B) such that: 


e If T < T4, we have that, with probability tending to 1, there exists a 
Hoólder(a, B) curve that contains at least T . n!/(1*9) points (out of n). 

e If T > T4}, we have that, with probability tending to 1, there does not exist a 
Hoólder(o, B) curve that contains more than T - n!/(*9) points (X,)" ,. 


(More concretely, if we deal with Lipschitz curves with |slope| < 1, we have found 
empirically that for moderate n ~ 1000, there will frequently be some Lipschitz 
curve that contains ./n data points, but rarely will there be one that contains more 
than 3./n points.) So, if we happen to notice a curve passing through substantially 
more than T4 : n!/1 **? points, we have a strong basis to reject the null hypothesis 
of pure randomness. Moreover, to within a constant factor this threshold is optimal; 
no sequence of tests can be reliable for detecting substantially fewer than T_ - 
n'/G+@) points, for a certain T_ > 0. 

Elaborating this connect-the-dots (CTD) principle leads to a formal hypothe- 
sis test based on searching for curves that contain large numbers of points. Let 
N&4(AÀ) = #{{X;} O A} denote the measure that counts how many points lie in the 


328 E. ARIAS-CASTRO, D L. DONOHO AND X. HUO 


set Á. Searching for a curve that contains the maximal number of points leads to 
the optimization problem 


N; (a, B) = max(N, (graph(f)): f e Hólder(a, B)}, 


which searches over all Hólder(c, B) graphs and rejects Ho for values of N* that 
substantially exceed T, - n1/0 +0, 

The CTD approach, while very instructive, does not address the concerns of 
someone who might actually be interested in performing a test on real data. Such 
concerns include: 


e Computational burden. The task of finding the largest number of points on 
a Hólder(a, B) curve seems to us to be computationally impractical unless 
a € {1,2}. 

e Unknown a, B. The CTD approach assumes a specific œ- combination. In- 
stead, we desire an algorithm that works regardless of the specific values of 
a € (1, 2] and £ > 0. 

e Fragments. The CTD approach searches only for graphs that extend all the way 
across the square from x = 0 to x = 1. Instead, one wants an algorithm that 
works even for short graphs. 

e General planar curve. The CTD approach assumes that the underlying curve 
can be parametrized as a graph. It seems important to search for general curves 
rather than just graphs—for example, curves that loop around in the plane. 


1.2. An adaptive multiscale approach. In this paper we describe an approach 
that addresses the concerns just listed. Our proposal: 


1. Works across for a wide range of (o, P), and only requires knowing a bound on 
the maximum slope of the curve. 
2. Detects the presence of Hı provided 


pou oce rb 
for a constant 7, which depends on a, f and other factors. In view of earlier 
results, this is optimally sensitive to within a factor T,/ T... 
3. Runs in O(n? - log(n)) flops. 
. Extends naturally to detect general planar curves that are not graphs. 
. Extends naturally to detect target filaments of unknown extent that, in large 
samples, can be very short compared to the image extent. 


Un RR 


The detector is based on a kind of multiscale geometric analysis of the data 
set, using a multiscale dictionary of parallelogram strips that exhibit a variety of 
lengths, locations, orientations and aspect ratios. The idea is to count membership 
of data points in various strips, to identify strips with significantly large counts and 
to search for long runs of significantly large counts in collections of strips that are 
“good continuations” of each other. 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 329 


The detector is adaptive to the unknown smoothness (a, B) in the sense that it 
achieves near-optimal performance over a wide range: 1 < œ < 2, B > ü. (This 
notion of adaptivity parallels the notion of adaptive near-minimaxity in nonpara- 
metric smoothing, in which a single estimator, able to perform in a near-minimax 
way across a whole range of different smoothness conditions, is called adaptive 
to unknown smoothness [17].) Ultimately, such adaptivity flows from ideas behind 
Lemma 2.2 below, which show that our class of strips has certain covering proper- 
ties uniformly over each smoothness class in the range 1 < o < 2. 

An interesting aspect of our approach is how simply and naturally the principle 
of good continuation appears and leads to a solution. 

Note that here we consider only the presence of the underlying curve—a detec- 
tion problem. Another question—the estimation problem—is to locate the position 
of the curve accurately. The performance of our procedure for estimation will not 
be addressed here. 


1.3. Contents. Section 2 describes our underlying multiscale data structures. 
Section 3 describes our adaptive algorithm in general terms and gives a statement 
of our main result. Section 4 describes the threshold settings that underlie our algo- 
rithm, while Sections 5 and 6 analyze its behavior under Hp and H1, respectively. 
Section 7 finishes the paper with a discussion of related work. 


2. Multiscale anisotropic strips and good continuation. Our data structures 
comprise a multiscale collection of anisotropic, tilted planar regions and a se- 
quence of directed graphs that organize them. We use ideas and notation common 
in dyadic multiscale analysis (e.g., dyadic partitioning) [10, 14—17, 24]; in partic- 
ular, we assume that n is large and find it convenient to let J = [log5(n)] denote 
its dyadic logarithm. The variable j will index dyadic scales 2^7 and will range 
through 0 <j < J. 

In our construction we fix in advance S > 1 (e.g., 2 or 4); this controls the 
maximum |slope| we will be able to detect. 

Let R(j, k, £1, £2) be a parallelepiped with vertical sides that is w = 27} wide 
by t — 2-0 —D*! thick. Here j runs through our set of scale indices (0, ..., J}. 
For examples, see Figure 1. The regions in question have a midline that bisects 
them vertically and will be tilted (sheared) at a variety of angles. Notice that these 
regions are highly anisotropic. While the whole collection implicitly depends on n, 
we suppress this in our notation. Moreover, the width w and thickness t depend 
on j and n, but we also suppress this in our notation. Note that the degree of 
anisotropy is the same for all regions that share a common value of j; we generally 
focus only on one anisotropy class j at a time. 

The parameters k and Z;, i = 1, 2, control the horizontal location of the regions 
and the vertical] location and slope of the midline. There is an underlying assump- 
tion that we are interested only in regions whose major axis has a slope bounded in 
absolute value by S. The mapping between these discrete parameters is intended 


330 E. ARIAS-CASTRO, D L. DONOHO AND X. HUO 


Í (thickness) 





, 
" C (center) 







S (slope) 


-— au æ æ æ e a ee — wm um mo um moo ‘M 


W (width) 


FIG 1. An anisotropic strip R. 


to insure that the regions pack together horizontally and that they are fairly closely 
spaced in both vertical position and slope. Let 61 = 1/4 and 62 = t/(4w) (these 
again depend implicitly on j and n). The parallelepiped R(J, k, £1, £2) will be 
centered at c = ((k + 1/2)w, £181) and its midline will have slope s = £257. Here 
0ck-w-l,£; runs through the set 0, "T — ] and £» runs through the set 
—88,1,...,88; 1. 

We gather all such regions at level (scale) j in A(j) = {RU,k, £1, £3) : k, 
£1, £2}. 

To organize the regions, we define a directed graph $(/) = (V(j), 8(j)), 
with vertices V(j) and edges &(j). The vertices are simply the regions in 
A(J):V(j) = RG). The edges connect regions to their good continuations, 
namely, regions that are horizontally adjacent, and that have altitudes and slopes 
that are nearly the same—less than ô; and 52 apart, respectively. Formally, we have 
the directed edges in &(/), 


(k, £1, £5) => (k +1, £1 + £2 +u, £5 +v), 


where |u| < 4, |v| < 4. 
Figures 2 and 3 illustrate good continuation and bad continuation, respectively. 
This graphical structure, while very simple, has a perhaps surprising property: 
it allows us efficiently to cover the graphs of general smooth functions that exhibit 
any of a range of smoothnesses. This claim is summarized in three lemmas. The 
first lemma associates to each Holder class a specific anisotropy graph. 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 331 


A uw 


(a) (b) 


FIG. 2. Examples of good continuations; the midlines have about the same slope and there is a 
substantial part of a side in common. 


LEMMA 2.1. For each fixed (a, B) combination with 1 <a < 2, we have 
that for all sufficiently large n there is j* = j*(a, B; n) so that w = w(j*) and 
t = t(j*) obey 


(2.1) 2Bw* <t « 16Bw*. 


The next result shows that regions in the anisotropy class R(j*(a, B)) are well 
adapted to cover fragments of the graphs of the associated Hólder(a, B) class. 


LEMMA 2.2. Let j = j*(a, B) and suppose f is a Hólder(a, B) function with 
a domain that contains I, = [kw, (k + 1)w). Set xy = (k + 1/2)w, let £145; be the 
closest multiple of 81 to f (xy) and let £2 482 be the closest multiple of 82 to f'(xx). 
We say that the region R(j, k, £1 k, £2.X) is associated to f on Ix. This strip covers 
the graph of f over their common domain: 


(2.2) graph(f|Ik) C RG, E, £1,, £2,k)- 


(See Figure 4.) 
The final lemma in the sequence shows that every function in the Hólder(o, B) 
class corresponds to a covering sequence of regions that makes a connected path 


in $(7). 


LEMMA 2.3. Let j = j*(a, B) and suppose f is a Hólder(a, B) function on 
[0, 1]. For each k =0,..., w7! — 1, consider the region Ry = R(J, k, £1, £2.X) 


332 E. ARIAS-CASTRO, D. L DONOHO AND X. HUO 


(a) 


(b) 


FIG. 3. Examples of bad continuations, either the midlines have very different slopes or the sides 
are effectively disjoint. 


associated to f by the procedure mentioned in Lemma 2.2. The sequence of strips 
JT (f)={Rk:0 <k < w-1) consists of spatially adjacent regions, making a kind 
of tube. When viewed as vertices of 9,(j), the (Ry) are neighbors in $.(j), that is, 
Ry and Ry44 can be connected using edges in &(j). Therefore, T, (f) corresponds 
to a path in $.( j). 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 333 





I 
I 
I 
t 
t 
H 
i 
i 
i 
t 
1 
l 
l 
l 
l 
i 
i 
H 
i 
1 
1 
1 
I 
l 
I 
I 


! Ik 
ee i 
x, 


FIG.4. Graph of f covered by its kth associated region Ry at scale j, fe is the tangent to graph(f) 
at xy and gy 1s the midline of Rx. 


These lemmas together show that, while the graphical structure itself is based 
on very simple rules, it is able to associate paths in the graph with custom-fitting 
tubes that cover the graphs of very different kinds of smooth functions. The proofs 
of these lemmas are given in the Appendix. Figure 5 illustrates the idea. 


3. The multiscale significant-runs algorithm. We now describe the com- 
plete algorithm for analysis of point-cloud data (X,) looking for suspected curvi- 
linear structure. It depends on a counting threshold N* and a length threshold L7, 
both to be defined later. The algorithm has several steps: 


1. Counting membership in anisotropic strips. For every region R, in every 
anisotropy class, we count the number of data that fall into that region, 


N(R) -&(i:X, € R}. 


2. Identifying significant counts. We define a significance indicator, which is 
nonzero when the counts exceed a threshold, 


S(R) = LIN(R)>N*}. 


334 E. ARIAS-CASTRO, D. L. DONOHO AND X. HUO 


p 


NX 


EN 
A 


FIG. 5. Graph of f covered by its associated tube T, (f) at scale j. 


The significance indicator may be viewed as a label on the regions R, producing 
a sequence of a labeled graphs 


EG) = (VG), 80), o )), 


where o (j) = (s(R)) gives the labels on R € R(j). We call this the jth signif- 
icance graph. 


3. Computing longest paths. In each significance graph, we employ a depth-first 
search algorithm to explore all significant paths 


JT = (R1, R2, eon Rm), 
that is, sequences of vertices that are: 


(a) all significant, s(R,) = 1; 
(b) all connected, (Rx, Re+1) € &(J). 


We record the maximum path length in each significance graph: 
Lir = max (length(x ) : x is a significant path in 22(7)], 


max __ max 
L, Barba ; 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 335 
4. Decision. We compare L7** with a length threshold: 
If L27* < L*, accept Ho; if Lp 7 > L}, reject Ho. 


This defines the test, except for the specification of the thresholds L* and N*. 
Asymptotic formulas for these thresholds will be given in Section 4. 

A worked example of the multiscale significant-runs algorithm is illustrated in 
Figure 6. For this example the synthetic point cloud is mostly uniformly distrib- 
uted, with fraction £ ^: 1/20 of points lying on a fixed smooth curve. The signif- 
icance threshold was N* = 8. For this choice of threshold, we have (under Ho) 
P(N(R) > N*} © 0.00024. 

The longest run in this example has length 5, which in this case exceeds the run- 
length threshold L* and leads to rejection of the null hypothesis. For this small- 
sample setting, the threshold L7 = 3 was obtained by simulation rather than as- 
ymptotic theory. (Under the null hypothesis, we conducted 1200 experiments. In 
each experiment, a point cloud was generated and L7** was computed. The fre- 
quencies of L7^* = 1,2, 3, 4 and 5 were 302, 873, 24, 0 and 1, respectively. Based 
on these results, L* can be set at either 2 or 3, giving a test with empirical level 
P{L; 7 > 2} = 25/1200 or P(L7?* > 3} = 1/1200, resp.) 

Asymptotic theory gives L7 ^: 3.74, which leads to the same decision. 

For comparison, Figure 7 gives a simulated example in the null case è = 0; 
the longest run has length 2 in this case. This simulation was typical; in the null 
example, rarely does the longest run exceed 2. Several properties of the algorithm 
are immediate: 


e Complexity of strip counts. The algorithm calculates all the N(R) for all the 
anisotropic strips. This takes O (n^ log(n)) flops, where n is the number of points 
sampled. Indeed, since each data point can belong to order O(n log(n)) strips, 
by simply calculating which R > X, and incrementing a counter for those R, we 
get all N(R). 

e Complexity of longest path. The algorithm calculates the longest path in each 
significance graph. This takes work comparable to O (n log(n)), based on depth- 
first search [2]. 

e Storage requirement. The algorithm stores all of the N(R) counts and all the 
significance coefficients. This requires O(n log(n)) storage. 


To state our main result, we amend our notion of alternative hypothesis. For 
a > 1, let Hólder(a, B, S) denote the collection of functions f e Hólder(o, B) with 
| f'|oo < S. Define 
Hy (a, B, S, x): X ~ (1 — e;)Uniform(0, 1)? + e Uniform(graph( f)), 
f in Hólder(a, B; S), €n > r -n- 9/09), 


336 E ARIAS-CASTRO, D L. DONOHO AND X HUO 


* * d * * * a 
+ . » 
EN a *e 1 . :'s et 9.» c x 22 bd ` ° 
Be * es s > .- 9 e* 
PE oben . . es ^ * z . 
+ s 2 at 2 
* fa e. * s * 945 Ps 
MD . : fe * ela *. e 
= -= ° $4. è oe E : ^. "a * " . i: " *" @ @ 
PR ` *e " "T Mh u " aa E 
"s "x e. * eat. t s ^ Mis M š 
. š » a ù .* * ^. * 
Ta . e m e . e 4 s. 8 ay ë e * 
T Ta tos a z? « "aue 
. 2 eet a * 2.. " eo E ied - 
. * 9 a 
at F t 5 a vee JI t. . * l © es æ s e" s * 
* 
. œ aa " x dc She wt. E ^ oy 
. ; tut 94 * è " os e. 7 
* e "e, we . - — er . 
^. J^ . v? 7 7" e Á ats, 4 
* a ` 1 es 2 i 4 » on seta. $ " 
e? , a * * named ec o SA LÀ s" e 
2 . 2 te. s .* á * . » * 
. a . $5 $* * * at * s e 
.* .* æ m m ee t. ."* 
. f e Us. ee oe > f c re . "$ 
e * a e. FR * . * a * P * 
. * @ , 99m t. ’ : bd . -a 
TID ". . yee “ee "eee e. ° t., 3 = * . . 
e o "7. ^. ^ "e . B "T "e * ae 
24 at a € e "m » .*» - 
r * `a Kk 
s ag w" of 2 * Ld 2% *» > " r 
* a .e7 * 6% . 2* T ete : =.: # e 
eer . >% ^ .* : P. t S s .*, *. 
^ a a . 
$, * FI ^. PN a . 4 o TS oe 
* oe a " da è . "4 M ` 
.* . 
* $ » ^. s t" - : ET í " S s> * . 
“s s pA se 2 a hg * g. * 
" Ea zr FN . * : t . d è 
3 * * bd " H 2 $ *. + ee + 
* $ a * . 





(b) 


FIG. 6. A uniform random scatter contaminated by € = 1/20 points on a curve, together with the 
identified significant run (consisting of five strips); n — 1024. 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 337 





FIG. 7. A uniform random scatter, together with the identified longest run of significant strips. At 
length 2 it ıs not a significant run; n = 1024. 


THEOREM 3.1. There is a single choice of thresholds N* and (L3), so that 
for every a € (1, 2] and p > 0, there is T,(a, B, S) with, for each v > T,, 


P {test rejects Hg]Hi1(o, 8, 5, 17)) 1 as n — oo; 
at the same time 


P (test rejects Ho|Ho} — 0 as n — oo. 


In words, under Hy the longest significant path is overwhelmingly unlikely to be 
substantially longer than L7 for large n, while under each indicated H the longest 
significant path is overwhelmingly likely to be substantially longer than L*. The 
threshold 7, is within a constant factor 7,/7_ of the optimal detection thresh- 
old; this shows that, up to constants, we can adaptively test for the existence of 
fragments of C^ graphs. 


4. Asymptotics of thresholds. For practical finite-sample application of the 
test just described, it is of course possible to calibrate thresholds by conducting 
simulation experiments. For the proof of our main result, we specify thresholds in 
closed form. These thresholds are very conservative and our closed-form analysis 
yields vastly overstated estimates of error probabilities, which are nevertheless 
good enough for our proofs. 


338 E. ARIAS-CASTRO, D. L. DONOHO AND X. HUO 


4.1. Specification of N*. Fix aprobability po « 1. Define a counting threshold 
N*(e, A) with the property that 


P{Poisson(A) > NT} <e, 
where Poisson(A) denotes a Poisson random variable with parameter A. Let 


Bin(n, p) denote a binomial random variable with parameters n and p. By Poisson 
approximation to the binomial, we have 


P [Bin(n, A/n) > N* (e, À)) < 2e, n> ng. 
Set then N* = N*(po/162, 2). Our definition of N* gives us the key property 
(4.1)  P[s(R) = 1]Ho) = P(Bin(n, Z/n) > N*] x po/81, n> no. 


4.2. Specification of L*. We use a convenient, but nonstandard, notation bor- 
rowed from Arratia and Waterman [6]. For 0 « p « 1,log p denotes, in this uncon- 
ventional notation, the logarithm with base 1/p. At the same time, we maintain the 
original convention that if b > 1, log, means log to the base b. By this convention, 
log is the traditional logarithm base two and log; ;2 is actually the same quantity. 

Define the length threshold 


(4.2) L* = 3log,, (n). 


where po € (0, 1) is the same as in the specification of N*. The underlying ratio- 
nale for this choice is the Erdós-Rényi law (see [6]), which says: 
In a sequence of m i.i.d. Bernoulli random variables with probability p of heads, the 
length of the longest run of pure heads ~ log ,(m)(1 + op(1)). 


In effect, our definition makes L7 very substantially longer than the length of the 
longest run of pure heads in a linear sequence of O (nV/(1*9)) coin tosses of a po 
coin. 


4.3. Specification of Tą}. Associated to the parameters N* and L* will be the 
threshold 7, at which H; becomes detectable. To define that, set p, sufficiently 
close to 1 so that, for all a € (1, 2] and for some n; > 0, 

(4.3) log, (n Q9) >2. L5, n>n] 
(pi = Do. di works). Implicitly, this choice again refers to the Erdós-Rényi law 
and, in some way, guarantees that under H; there will be very long runs. 

Define an intensity threshold A +(e) with the property that 


P (Poisson(A* (e)) < N*} « e. 
Set A* = A*((1 — p1)/2) and set 


T, (o, B, S) = 2a* . pY 0t) . J14 82, 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 339 


We will show that for 

(4.4) bs up ar Oo) 

the hypothesis H; (œ, P) becomes detectable by our proposal. The central point 
will be the property 

(4.5) n-En: w(j'(o, B,n))/J14- S? z A*. 


To prove this, we inspect the proof of Lemma 2.1 in the Appendix and note that, 
by definition of T, (o, B, S) above and using notation from the Appendix, 


n-&,-w(j*(a, B n))/ y1 + S? 


> Cro . T,- wit (a, B, n)/Qu14- $2) 2 A*. 


Incidentally, we do not claim that the algorithm fails for t < 7,, only that we can 
prove it succeeds for t > T,. 


5. Behavior under H;. Let f be the function in Hólder(a, P) that generates 
the curve that carries the fraction ej, of data. From Lemmas 2.1-2.3, we let j = j* 
and consider the tube 7;( f). For each region R in this tube, 


N(R) ~ Bin(n, (1 — &x)area(R) + ey Cf, R)), 
where area(R) — 2/n and y denotes the relative arc length in the graph of f, 
obeying 
length(graph( f|) w 
length(graph(f)) ~ 14-82 


here 7 denotes the projection of R on the x-axis. 
By Poisson approximation to the binomial, 


N(R) ^ Poisson(w), 
where, using (4.5), 
w>1+ne,w/¥ 14+ 8? 
-A*. 
Hence, for every R in this tube, we have, for all sufficiently large n > n3, 
(5.1) P{N(R) > N*} 2 pi. 


Label the sequence of strips R in this tube Ro, ..., R,,-1_,. We want to know 
the probable length of the longest run of the form 


N(Ry) > N*,..., N(Rezt) > N*. 


340 E. ARIAS-CASTRO, D. L. DONOHO AND X. HUO 


Then, if L, is the length of the longest run in this special sequence, it follows that 
the longest run statistic we are computing over the entire graph can only be larger: 


Lb "as ee 
To show that the test rejects Ho, we will show that 
(5.2) P{L, > L3) — 1, n — oo. 
If we define 
Zi = lN(R)»N*), 
we note that each Z; is Bernoulli with probability p,, while 
Pi Z pi. 


We let m = w^! > Const -n'/C+® and p = pj, and we get, by the Erdós-Rényi 
law, 


Ly > log, (n *9)(1-- op(1)). 
However, by hypothesis we have chosen p; in such a way that 
log, (ien sop 
Hence (5.2) follows. 
6. Behavior under Hg. We need to show that, with overwhelming probability 


under Ho, there will be no runs in the graph that exceed 7. We start by arguing 
that 


P (a significant path of length L starts at given R|Ho} < po. 
Indeed, by choice of N*, for each R, 
P{s(R) = 1|Ho) = P(Bin(n, 2/n) > N*} < po/81. 
Now each region R has 81 neighbors in $(7), and so by Boole's inequality, 


P{s(R’) = 1 for at least one neighbor of R|Ho} 
< >  P(sR)- 1JHo) 
R'eneighbors(R) 
< st neighbors - (po/81) 


= 81 - (po/81) = po. 


Now if we are looking for a significant path of length greater than L, we need 
that starting at some vertex, it has s(R) — 1 and is connected to a vertex with 
s(R’) = 1, et cetera. For a given starting point R, the probability of this event is 
bounded by pi via negative correlation. 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 341 


We now note there are at most Mj = w^ !- Ja . j! : 28 starting points for paths 
in $(j). By Boole's inequality, 


P (there is a significant path of length L occurring in 9(j)|Ho} 
< # (starting points at level j) 
x P significant path of length L starting at R|HoJ 
< M, pj. 
Take log»: 


logo; (Mj) + logo(po) - L = log;(w ^! T m (851 - 2S) + log. (po): L 
=2J —2j7+3+log,(S) + log.(po)- L 
<2J+C+1logs(po):L. 


For L = L*, the last expression on the right-hand side tends to —oo as n increases. 
Hence 


P (there is a run of length L7|Ho] — 0, n — co. 


7. Discussion. We have given only a sampling of results in a specific problem 
of geometric detection; much more could be done. At the same time, our results 
are closely related to many ideas in the literature of computer vision. We briefly 
indicate possible variations and sketch a few such connections. 


7.1. Variations on the filament model. We have only discussed a subset of 
what could pass for filamentary structure in point-cloud data. Other notions of fila- 
mentarity include (a) curves that have less regularity than one derivative, (b) curves 
that have more regularity than two derivatives and (c) curves that are not describ- 
able as simple graphs (x, y = f(x)). Our aim in this paper is to stimulate dis- 
cussion; we believe that all such generalizations will be of interest in appropriate 
applications areas. We make a few brief remarks. 


e Curves that have regularity a < 1. If we consider curves (x, f (x)), where f 
is Hölder-æ with a < 1, we are considering curves without tangents. We thus 
discard completely the notion of good continuation based on alignment of tan- 
gents. In designing graph-based detectors, it is only required to use axis-oriented 
rectangular regions, so that only position (not slope) matters, and in which the 
rectangles are now taller than they are wide; Connectivity involves only position 
(not orientation). The statistical treatment based on graphs and runs turns out to 
be the same; the graphs simply have less structure because proximity does not 
involve similarity of slopes. 


342 E. ARIAS-CASTRO, D. L. DONOHO AND X. HUO 


e Curves that have regularity a > 2. It makes sense to ask about smoothness of 


higher order, for example, to consider 2 < a < 3. By [5], there will continue 
to be about n/(! *9) points on some Hilder curve, even for æ in this range. To 
sensitively detect such higher-order smoothness would require a different set 
of regions than the one discussed here— curved ones with parabolic midlines— 
and a notion of good continuation based on matching of sides, and matching of 
slopes and curvatures of midlines. Preliminary calculations suggest that analo- 
gous "adaptivity" results hold in such a setting. Related discussion can be found 
in [3, 5]. 

Curves that are not graphs. The Introduction suggested that the approach de- 
scribed here can be adapted to detect general plane curves, that is, curves that 
are not graphs. This adapts ideas from our work on beamlet graphs [4, 16]. We 
define a family of directed graphs based on regions modeled on "dyadically 
thickened beamlets" with various degrees of thickening. In this directed graph 
structure, strips can have all orientations, including vertical and horizontal, so 
that the graph constraint is removed. Connectivity between beamlets is based 
once more on good continuation principles—in this case, continuation of plane 
polygons rather than polygonal graphs. Otherwise the algorithms are identical. 
More details can be found in [5], where this structure 1s utilized to prove a the- 
oretical result. 


Still other variations on our model are possible. The idea that data are uniformly 


sampled from a curve of zero width can be varied in several ways: 


Nonuniform sampling along filaments. A referee suggested that rather than from 
a uniform density, data might arise instead from a density that is bounded away 
from zero and infinity. The same methodology developed here works in that case 
without change, except that the analysis under H; seemingly becomes more 
involved. 

Finite thickness. À referee also suggested that rather than from a curve of zero 
thickness, the data might arise instead from a tube of finite thickness. The 
methodology developed here works without change in such a case, but the 
model H; is different and the statement of results becomes different. Thus, if 
the thickness of the tube is finite, but smaller than the width of the regions that 
are adapted to the underlying filament, the detectability results are the same as 
here. If the width 1s greater than that of regions adapted to the curve, then the 
detection threshold becomes higher and it takes systematically larger numbers 
of points near a curve to reject Hp. 

Finite resolution. A referee also suggested that the data might be of finite ac- 
curacy, for example, either subject to rounding or to noise. In either case, the 
situation is much the same as in the immediately preceding comment. If the 
inaccuracy is small compared to the width of the optimally fitted regions, its 
impact is negligible. If the inaccuracy is larger than that width, then the detec- 
tion threshold becomes higher and rejection of Hp requires systematically larger 
numbers of points near a curve. 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 343 


We leave exploration of such issues for future work. 


7.2. Beyond detection. A referee pointed out the desirability of nct just 
detecting the presence of filamentary structures, but also estimating the detailed 
location and shape of any filaments that are detected. Of course, conceptually de- 
tection and estimation are quite different tasks; moreover, unless detection is pos- 
sible, estimation is impossible. Empirically, our methodology actually provides 
an estimator when the filament is not just detected, but strongly detected. As the 
number of points sampled from the filament increases well beyond the detection 
threshold, simulations show that the longest run becomes overwhelmingly likely 
to trace out a series of regions that bracket the curve tightly. It seems likely that this 
effect could be proven rigorously. However, it seems rather delicate to formulate 
and study an appropriate notion of asymptotically efficient estimation. 


7.3. Extensions beyond the filament model. We can extend beyond the setting 
of two-dimensional point clouds in at least three ways: going to higher dimensions, 
observing vectors rather than points and observing pixel imagery on a grid rather 
than scattered points. 


7.3.1. Structures in higher dimensions. The analogous detection problem in 
d-dimensional space—finding a curve or surface that contains an unexpectedly 
large number of points—has been considered by the authors in [3, 5]. That work 
provides nonadaptive detectors, that is, detectors that assume knowledge of the 
Hólder class. 

The ideas developed in this paper can be directly applied to multiscale detec- 
tion of filamentary structure in d-dimensional point clouds, d > 2. A more ambi- 
tious generalization—detection of codimension-k surfaces in d-dimensional point 
clouds—seems possible, but also messier. For example, for d = 3 and k = 1, we 
are attempting to find a surface in d-dimensional space that contains an inordi- 
nately large number of points. The nodes of each anisotropy graph are planar slabs 
of volume 2/n and the neighborhood structure in the graph is, while conceptually 
analogous to the case considered here, far more complex to write out. The compa- 
rable adaptive detection theorems hold in that setting, although we omit details. 


7.3.2. Vector fields. Suppose that instead of data on points X,, we have data on 
tangent vectors, that is, on pairs (X,, 6j) that name both a position and a direction. 
Experiments in perceptual psychophysics (e.g., [19, 28]) suggest that this is a much 
more potent stimulus to "curve finding" than simply the display of random dots. 
Biological evidence about early vision suggests that individual receptive fields fire 
when both location and orientation offer matches. 

As a null hypothesis, we suppose that the X;'s are i.i.d. uniform[0, 1], while 
the 6,'s are 1.1.d. uniform[—z/2, 7/2]. As an alternative hypothesis, we could posit 


344 E. ARIAS-CASTRO, D. L. DONOHO AND X. HUO 


that a small fraction of the X, lie on a curve, X, = (x,, f (xi)), and the 6; specify 
angles parallel to the line with slope f'(x,). 

Our paper [5] showed that such tangent vector data are substantially more pow- 
erful for identifying filamentary structure than the point data discussed so far in 
this paper. Namely, such data give us the ability to detect filaments that contain 
much smaller fractions of data points. In fact, a reliable test for a Hólder(2, 1) fila- 
ment can be based on agreement with more than T'n!/^ tangent vectors, rather than 
T nl? points. 

The multiscale data structures used in this paper can be applied in the tangent 
vector setting, where we match (X;,6j) to a region based not only on X; e R’, 
but also on 6, matching the slope of the midline of R. The resultant algorithm 
can take advantage of this more stringent matching structure to speed up the 
counting process, because each tangent vector will lie in only one region in a 
given anisotropy class, resulting in O(log(n)) flops per data point rather than 
O (nlog(n)). Searching for significant paths will be significantly faster as well, 
since there can only be nlog(r) starting places for a path. Hence the whole al- 
gorithm can run in O(nlog(n)) flops. Àn analysis that parallels the one given 
here shows that a multiscale multianisotropic significance runs algorithm can pro- 
vide a detection threshold that is optimal to within a constant factor. Huo and 
co-workers [23] gave more details from the computational aspect of this problem. 


7.3.3. Pixel imagery. In another direction, we might consider data types used 
to model digital imagery, for example, arrays (y(i1, i2) :0 < i1, i2 < n), where 


yn, = EF (1, i2) + oz(i, i2), O<ij,i2 <n, 


(z(i1, i2)) is a Gaussian white noise, o is the noise level and Ep is a pixel array 
with nonzero values only on pixels that intersect graph( f). Despite appearances, 
this problem is closely related to the present problem and analogous detection 
theorems are true. 

In effect, we form a family of anisotropic multiscale strips and sum the pix- 
els that intersect those strips, producing detector statistics X(R) that can be 
significance-tested in a way that parallels the counts N(R) considered in this pa- 
per, only with Gaussian rather than Poisson threshold analysis. The underlying 
data structures and arguments can be understood as, roughly speaking, a mixture 
of the ideas of this paper and those of another paper by these authors [4]. In partic- 
ular, the strips considered here were called axoids in that work. More details can 
be found in [21, 22]. In the digital array setting, there is a fast beamlet transform 
to rapidly compute all the required X (R) detector statistics. 


7.4. On the uniform background assumption. A referee remarked that many 
practical imaging problems do not involve small departures from a uniform back- 
ground. This is no doubt true. However, there is an everexpanding array of imaging 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 345 


problems, and some look for changes between one image and another, for exam- 
ple, in studying arterial blood flow or in change detection in scene surveillance. In 
the case of no change, the background is quite literally uniform. Also, many prob- 
lems with nonuniform background are transformable to uniform background; thus 
edge detectors and object detectors are typically operated at so-called constant 
false alarm rate. Then they give, under the null hypothesis of no object present, 
roughly a constant number of events per unit area, and so the constant false alarm 
rate transformation forces a uniform background under Ho. Finally, the intellectu- 
ally important issues explored here seem clearest in the uniform case and the data 
structures developed here are known to be useful in more complex settings [4]. 

The main point, however, is well taken—detection of objects against nonuni- 
form clutter rather than uniform scatter remains a challenging area for further 
work. 


7.5. Relationships to other work. 


7.5.1. Parametric detection. In this paper we considered detection of points 
on nonparametric curves. In certain cases, one is interested in points along lines [8] 
or on parabolas [1]. For an attractive nonmultiscale approach to such detection 
problems, see [11—13] and [33], Section 6.3. 

In the authors’ paper [4], which was just mentioned, 1t was shown how to de- 
velop multiscale geometric detectors for line segments in digital imagery, for para- 
metric forms such as circles, rectangles and ellipses. Those ideas could be adapted 
to the present setting to find situations where data have an elevated density over a 
blob or along some line. In the end, the underlying computations involve dyadic 
multiscale rectangles and strips, and the ideas are closely related to those in this 


paper. 


7.5.2. Multiscale geometric analysis. ‘The tools described here are closely re- 
lated to a variety of tools in multiscale methods; see [10, 15, 16, 18, 24, 27, 32] for 
discussions of related tools applied in image analysis and in mathematical analysis. 
We differ here in our use of a multiscale multianisotropy collection of analyzing 
regions that is organized and exploited in a specific way and for a specific purpose. 


7.5.3. Object grouping and neural architecture. In the literature of computer 
vision, there is extensive discussion of object grouping in perception [7]. The 
problem considered here—recognizing a curve against a background of random 
points—fits in this tradition. There are even experiments in psychophysics that test 
the ability of the human visual system to accomplish similar tasks [19, 25]. In 
effect, what we have discussed here—(near-) optimal detection—corresponds to 
what psychophysicists call "the ideal observer" [26]. In this connection, we have 
exhibited a simple multiscale architecture that can provide a near-optimal detec- 
tor for a very wide range of stimuli—curves of any of a wide range of degrees of 
smoothness. 


346 E. ARIAS-CASTRO, D. L. DONOHO AND X. HUO 


In the Gestalt theory of perception, there arises the concept of good continuation 
[19, 35]; experiments show that the visual system will respond better to curvilinear 
stimuli that follow a good continuation of an initial pattern [30]. Here we have op- 
erationalized this principle by a regular connectivity pattern in a specific graphical 
structure. We have shown that with this implementation, we get a near-optimal de- 
tector, thus validating the significance of such a good continuation principle. Note 
well that the connectivity pattern is invariant; that is, it applies the same way at all 
nodes of the graph. 

This architectural simplicity is striking when compared with the vast specula- 
tive literature that proposes "neural architectures" for visual perception. What we 
have shown is that by starting from a large collection of elements that are sensitive 
to (i.e., accumulate counts in) receptive fields at a variety of lengths, widths, ori- 
entations and locations, and then connecting such elements to other elements by a 
simple invariant rule, one very sensitively recognizes the existence of curvilinear 
stimuli simply by the existence of long connected paths. 

Perhaps this bears comparison with biological evidence. Functional magnetic 
resonance imaging studies [9] suggest that there are centers in primate brains that 
seem responsible for integrating local information into recognition of long curvi- 
linear structures [29]. It would be interesting to know whether such integration has 
any resemblance to the simple multiscale connection mechanism employed here. 


APPENDIX 


PROOF OF LEMMA 2.1. Extend notation so that w(j) — 2^7 can have both 
real and integer arguments, and t(j) = 270 7*1 as well. Let jt = j*(o, B,n) 
satisfy 

2Bw(j^y* —t(j^), 
that is, 
2g2-«* oc Ce 
Let j* = [j*] be the next larger integer, so that w(j ^)/2 < w(j*) x w(j ^) and 
t(j^) x t(j*) < 2t(j*). Then because 1 € a <2, 2% < 4 and so 
2Bw(j*)* < Bw «t97) stG*) 
x2.t(j^) - 4BuG^Y* x 168wG*Y*. 

Now let w = w(j*) and t = t(j*), and substitute in the last display, getting (2.1). 

CJ 


PROOF OF LEMMA 2.2. Remember that f € Hólder(ao, B) satisfies f:[0, 
1] — [0, 1] and 


IF- F'O) x aBlx — y. 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 347 


We saw that this implies 


(A.D | Ife)- fon -OE -yl sgl- yl, — x, ye [0, 11. 


To prove (2.2), we use the notation fx(x) for the affine function tangent to 
graph( f) at xy. Figure 4 illustrates our notation. 
Using (A.1), with 7, denoting [kw, (k + D)w), 


f(x) — k S B(w/2)* <t/4, xEk. 
We also note that if 2x (x) = 1,4; + l282(x — xx), then 
| f(x) — EE) < fe) — hài + Fe) — b&b — xe 

< 1/2 + 05/2 x w/2 

«x t/8 -- t/106. 
We conclude that 
(A.2) f(x) — £x(x) S 1/2, — x € ly. 
On the other hand, the region R(J, k, £1, 22) has g(x) as its midline and is of 
half-height 1/2. The desired relationship (2.2) follows. ( 


PROOF OF LEMMA 2.3. We use the same notation as in the proof of 
Lemma 2.2, as illustrated in Figure 4. 


It is enough to show that 
(A.3) [Sk+1 K+) — xe DE St 
and 
(A.4) lki Qe) — Èk Oo] < t/w. 


It then follows that there is an edge in 8(j) that connects R(j, k, £1, 22) to Rj, k+ 
1, £1, £2), where £} and £; are the values associated to 2441. 

The following inequalities flow either from Hólder conditions or from simple 
rounding involved in quantization: 


Igi e+) — foe] < 81/2 — 1/8, 
|f Ge — Serer x Bw? x t/2, 
| fic Qe) — Ek (xk+1)] < 81/2 + 05/2 w — t/4. 
Combining these with the triangle inequality yields (A.3). Similarly, 
léki Gee) — f Gee) < 82/2 = t/(8w), 
Lf’ Gen) — Ak) S abw < t/Qw), 
| fe Grec) — & Quee] < 82/2 = t/(8w); 
combining these identities gives (A.4). O 


348 E. ARIAS-CASTRO, D. L. DONOHO AND X. HUO 


Acknowledgments. We would like to thank Emmanuel Candés, Hagit Hel- 
Or, Aapo Hyvárinen, Jean-Luc Starck and Brian Wandell for helpful discussions 
and references. 


REFERENCES 


[1] ABRAMOWICZ, H., HORN, D., NAFTALY, U and SAHAR-PIKIELNY, C. (1997). An 
orientation-selective neural network for pattern identification 1n particle detectors. In 
Advances in Neural Information Processing Systems 9 (M. Mozer, M. I Jordan and 
T. Petsche, eds ) 925—931. MIT Press, Cambridge, MA. 

[2] AHO, A. V., HOPCROFT, J. E and ULLMAN, J. D. (1983). Data Structures and Algorithms. 
Addison-Wesley, Reading, MA. MR0666695 

[3] ARIAS-CASTRO, E. (2004). Graphical structures for geometric detection. Ph.D. dissertation, 
Stanford Univ. 

[4] ARIAS-CASTRO, E , DONOHO, D. L. and Huo, X. (2005). Near-optimal detection of geo- 

metric objects by fast multiscale methods. IEEE Trans. Inform. Theory 51 2402—2425. 
ARIAS-CASTRO, E., DONOHO, D. L., HUO, X. and TOVEY, C. (2005). Connect-the-dots: 
How many random points can a regular curve pass through? Adv. in Appl. Probab 37 
571—603. MR2156550 
ARRATIA, R. and WATERMAN, M. S (1989) The Erdos-Rény: strong law for pattern match- 
ing with a given proportion of mismatches. Ann. Probab. 17 1152-1169. MR1009450 
BUHMANN, J M., MALIK, J. and PERONA, P. (1999). Image recognition: Visual grouping, 
recognition, and learning. Proc. Natl. Acad. Sci. USA 96 14,203—14,204. 
[8] COPELAND, A. C., RAVICHANDRAN, G. and TRIVEDI, M. M (1995) Localized Radon 
transform-based detection of ship wakes 1n SAR images. JEEE Trans. Geoscience and 
Remote Sensing 33 35-45 
COURTNEY, S M and UNGERLEIDER, L. G. (1997) What fMRI has taught us about human 
vision. Current Opinion in Neurobiology 7 554—561. 
[10] DAVID, G. and SEMMES, S. (1993). Analysis of and on Uniformly Rectifiable Sets. Amer. 
Math. Soc., Providence, RI. MR1251061 

[11] DESOLNEUX, A., MOISAN, L. and MOREL, J.-M. (2000). Meaningful alignments. Internat. 
J. Computer Vision 40 7—23. 

[12] DESOLNEUX, A., MOISAN, L and MOREL, J.-M. (2003). A grouping principle and four 
applications. IEEE Trans. Pattern Analysis and Machine Intelligence 25 508—513. 

[13] DESOLNEUX, A., MOISAN, L. and MOREL, J.-M (2003). Maximal meaningful events and 

applications to 1mage analysis. Ann. Statist. 31 1822-1851. MR2036391 
[14] DoNoHO, D. L (1997). CART and best-ortho-basis: A connection Ann. Statist. 25 
1870-1911. MR1474073 

[15] DONOHO, D. L. (1999). Wedgelets: Nearly minimax estimation of edges. Ann. Statist. 27 
859—897. MR1724034 

[16] DONOHO, D. L. and Huo, X. (2002). Beamlets and multiscale image analysis. In Multiscale 
and Multiresolution Methods. Lecture Notes Comput. Sci. Eng. 20 149—196. Springer, 
Berlin. MR1928566 

[17] Donouo, D. L. and JOHNSTONE, I M. (1995). Adapting to unknown smoothness via wavelet 

shrinkage. J. Amer. Statist. Assoc 90 1200-1224. MR1379464 

[18] DoNoHo, D. L. and LEVI, O. (2004). Fast X-ray and beamlet transforms for three- 

dimensional data. In Modern Signal Processing (D. N. Rockmore and D. M. Healy, Jr, 
eds.) 79-116. Cambridge Univ. Press. MR2075950 

[19] FIELD, D., HAYES, A. and HESS, R. (1993). Contour integration by the human visual system 

Evidence for a local “association field.” Vision Research 33 173-193 


[5 


— 


[6 


Ls 


[7 


utl 


[9 


LL 


ADAPTIVE MULTISCALE DETECTION OF FILAMENTARY STRUCTURES 349 


[20] Ho, M.-W. (2004). In search of the sublime. Institute of Science in Society. Available at 
www.1-sis.org.uk/sublime.php. 

[21] Huo, X., CHEN, J. and DONOHO, D. L. (2003). Multiscale detection of filamentary features 
in image data. In Wavelets: Applications in Signal and Image Processing X (MÀ. Unser, 
A. Aldroubi and A. F. Laine, eds.) 592-606. SPIE, Bellingham, WA. 

[22] Huo, X., CHEN, J. and DONOHO, D L. (2003). Multiscale significance run: Realizing the 
"most powerful" detection in noisy 1mages. In Proc. Thirty Seventh Asilomar Conference 
on Signals, Systems, and Computers 1 321—326. IEEE, Piscataway, NJ 

[23] Huo, X., DONOHO, D L , TOVEY, C. and ARIAS-CASTRO, E. (2004). Dynamic program- 
ming methods for "connecting the dots" 1n scattered point sets. Technical report, Dept. 
Statistics, Stanford Univ. 

[24] JONES, P. W. (1990). Rectifiable sets and the traveling salesman problem. Invent Math. 102 
1-15. MR1069238 

[25] KOVACS, I. and JULESZ, B. (1993). A closed curve is much more than an incomplete one: Ef- 
fect of closure in figure-ground segementation. Proc. Natl. Acad. $ci. USA 90 7495—7497. 

[26] LEGGE, G. E., KERSTEN, D and BURGESS, A. E. (1987). Contrast discrimination 1n noise. 
J. Opt. Soc. Amer. A 4 391—404. 

[27] LERMAN, G. (2003). Quantifying curvelike structures of measures by using L2 Jones quanti- 
ties. Comm. Pure Appl. Math. 56 1294-1365. MR1980856 

[28] LEVI, D. M. and KLEIN, S. A (2000). Seeing circles: What limits shape perception? Vision 
Research 40 2329-2339. 

[29] MENDOLA, J. D., DALE, A. M., FISCHL, B., LIU, A. K. and TOOTELL, R. B. H. (1999). 
The representation of illusory and real contours ın human cortical visual areas revealed 
by functional magnetic resonance 1maging. J. Neuroscience 19 8560-8572. 

[30] PIZLO, Z , SALACH-GOLYSKA, M. and ROSENFELD, A (1997). Curve detection in a noisy 
image. Vision Research 37 1217—1241. 

[31] QADDOUMI, N., RANU, E., MCCOLSKEY, J. D., MIRSHAHI, R. and ZOUGHI, R. (2000). Mi- 
crowave detection of stress-induced fatigue cracks in steel and potential for crack opening 
determination. Research in Nondestructive Evaluation 12 87—103 

[32] SHARON, E., BRANDT, A. and BASRI, R (2000). Fast multiscale image segmentation. In 
Proc. IEEE Conference on Computer Vision and Pattern Recognition 1 70—77. 

[33] SMALL, C. G. (1996). The Statistical Theory of Shape. Springer, Berlin. MR1418639 

[34] TUPIN, F., MAITRE, H., MANGIN, J.-F., NICOLAS, J -M. and PECHERSKY, E. (1998). De- 
tection of linear features in SAR images: Application to road network extraction. IEEE 
Irans. Geoscience and Remote Sensing 36 434—453. 

[35] WERTHEIMER, M. (1938) Laws of Organization in Perceptual Forms Harcourt Brace, Lon- 


don. 
E. ARIAS-CASTRO D.L DoNOHO 
DEPARTMENT OF MATHEMATICS DEPARTMENT OF STATISTICS 
UNIVERSITY OF CALIFORNIA, SAN DIEGO STANFORD UNIVERSITY 
9500 GILMAN DRIVE STANFORD, CALIFORNIA 94305-4065 
LA JOLLA, CALIFORNIA 92093-0112 USA 
USA E-MAIL: donoho@stat.stanford_edu 
E-MAIL: eariasca ? math ucsd.edu 


X. Huo 
SCHOOL OF INDUSTRIAL 

AND SYSTEMS ENGINEERING 
GEORGIA INSTITUTE OF TECHNOLOGY 
ATLANTA, GEORGIA 30332-0205 
USA 
E-MAIL. xiaoming @isye gatech.edu 


The Annals of Statistics 

2006, Vol 34, No 1, 350-372 

DOI 10 1214/0090536050000007 50 

© Institute of Mathematical Statrsticx, 2006 


OPTIMAL CHANGE-POINT ESTIMATION FROM 
INDIRECT OBSERVATIONS 


By A. GOLDENSHLUGER,! A. TSYBAKOV AND A. ZEEVI? 
University of Haifa, Université Paris VI and Columbia University 


Dedicated to Boris Polyak on the occasion of his 70th birthday 


We study nonparametric change-point estimation from indirect noisy ob- 
servations. Focusing on the white noise convolution model, we consider two 
classes of functions that are smooth apart from the change-point. We estab- 
lish lower bounds on the minimax risk in estimating the change-point and 
develop rate optimal estimation procedures. The results demonstrate that the 
best achievable rates of convergence are determined both by smoothness of 
the function away from the change-point and by the degree of ill-posedness of 
the convolution operator. Optimality is obtained by introducing a new tech- 
nique that involves, as a key element, detection of zero crossings of an esti- 
mate of the properly smoothed second derivative of the underlying function. 


1. Introduction. In this paper we study the problem of change-point estima- 
tion from indirect and noisy observations. Let f € L2(R) denote the unknown 
function. Consider the white noise model 


(1) dY (x) — (Kf)(x) dx - £dW(x), x €R, 


where W(-) is the standard two-sided Wiener process on R, 0 <e < 1, and K 
is the convolution operator with kernel K € IL;(IR) whose action on a function 


f € Lao (R) is defined by 
(2) Kw f - K(x —)f 0) dy. 


We assume that f is smooth apart from a jump discontinuity of the first kind at 
a point 6 and, without loss of generality, we suppose that 0 c [0, 1]. The problem 
is to estimate the change-point 0 based on the observation of a trajectory of the 
process Y (-) that satisfies (1). 

We study this problem in a minimax framework. Let Ô be an estimator of 6 
based on observation of Y (-) that satisfies (1). We measure the accuracy of 6 by 


Received July 2004; revised February 2005. 
l Supported by Israel Science Foundation Grant 300/04. 
Supported in part by NSF Grant 04-47652. 
AMS 2000 subject classifications. 62G05, 62G20. 
Key words and phrases. Change-point estimation, deconvolution, minimax risk, ill-posedness, 
probe functional, optimal rates of convergence. 


350 


CHANGE-POINT ESTIMATION 351 


the maximal risk 


R,[Ó; $] = sup (E | — 0 [?}'/ 
feg 


over a class of functions 9 that have a single change-point 0 € [0, 1]. Here Ep 
denotes the expectation with respect to the probability distribution P+ generated 
by the model (1) with given f. The minimax risk is defined by 


R:[9]— inf Re lô; 9]. 


where the infimum is taken over all possible estimators of 0. An estimator Ó is 
called rate optimal on the class ¢ if it satisfies 


R,0;$]- R*[9] ^ ase O. 


Our aim is to find rate optimal change-point estimators and to establish asymptotics 
of minimax risks for some natural classes of functions $. and operators K. 
Change-points and singularities are intrinsic features of signals that appear in a 
wide variety of applied contexts in economics, medicine and physical science. For 
many types of signals, change-points convey important information about under- 
lying phenomena. For instance, in images, discontinuities of the intensity function 
correspond to the location of the contour of an object that may be particularly im- 
portant for recognition purposes. We refer to the volume by Carlstein, Müller and 
Siegmund [4] for a comprehensive survey of the area and references. The problem 
of nonparametric change-point estimation has been extensively studied in the case 
where the observations of f are direct, that is, when K is the identity operator. 
For such a model, Korostelev [15] constructed a rate optimal estimator of 0 and 
derived the optimal rates of convergence. He showed that the minimax risk over 
the class of functions that have a single change-point and satisfy the Lipschitz 
condition away from the change-point, converges to zero at the rate e°, which is 
faster than the usual parametric rate. (Here and in what follows we have in mind 
a standard correspondence between the Gaussian white noise model and discrete 
sample models (cf. [3]), given by the calibration ¢ = n^ !/?, where n is the sample 
size. With this calibration, the term parametric rate refers to convergence with the 
rate e = n^ !/?.) For further work on nonparametric estimation of change-points 
and discontinuous functions from direct observations, see, for example, [1, 9, 18, 
21, 22, 27, 25] and the references cited therein. On the other hand, nonparametric 
estimation of a change-point from indirect observations, that is, for operators K 
that are not the identity, is much less studied. Furthermore, the literature contains 
some contradictory statements with regard to best achievable rates of convergence 
and therefore, leaves open the question of how to construct optimal estimators. 
An important result in this area is due to Neumann [19], who investigated the 
problem of change-point estimation from indirect observations in a density decon- 
volution model. He assumes that the observations are Y; = X; --£, i — 1,...,n, 


352 A. GOLDENSHLUGER, A. TS YBAKOV AND A ZEEVI 


where X; are i.i.d. random variables with unknown probability density f and 
where &, are i.i.d. random errors, independent of the X,’s, with known probability 
density K. The problem considered by Neumann [19] is to estimate the location 0 
of a discontinuity jump in f, where this density is assumed to satisfy a Lipschitz 
condition away from the change-point. Neumann [19] proved that the order of 
the minimax risk in estimating Ó is min(n ^/ 8*3), n V CPU. provided that the 
tails of the characteristic function K (œ) of £, decrease at the rate |w|~F, B 90. 
In the nonparametric regression context, Raimondo [21] considered the problem 
of estimating a change-point in the fth derivative of the regression function. As- 
suming that this derivative satisfies a Lipschitz condition apart from the change- 
point 8, Raimondo [21] claims that the best rate of convergence in estimating 0 
is n—!/@B+))_ Estimation procedures that achieve this rate were also proposed by 
Wang [26] and, more recently, by Huh and Carriére [13] and Park and Kim [20]. 
Clearly, if K is the Green's function of a linear differential operator of integer or- 
der B, estimating the change-point 0 of f from indirect observation as in model 
(1) is equivalent to estimating the change-point in the derivative of order B from 
direct observations. This fact indicates that there is a discrepancy between the rates 
of convergence obtained by Neumann [19], on the one hand, and by Raimondo [21] 
and other authors cited above, on the other hand. In particular, the rates obtained 
by Neumann [19] are faster. Although asymptotic equivalence between the two 
indirect observation models (the density model as in [19], and regression/white 
noise model as in [21]) has not been established formally, it would seem natural 
to expect that the rates of convergence are in agreement. In what follows, we will 
show that the "faster" rates of Neumann [19] can indeed be attained and they are 
optimal for the white noise model (1). This fact will be deduced from more general 
results. 

We study the problem of change-point estimation in model (1) for two different 
scales of functional classes 9, that quantify smoothness of f away from the change- 
point. We derive lower bounds on the minimax risk (see Theorems 2 and 4) and 
develop rate optimal estimators (see Theorems 1 and 3). In particular, we show that 
if f can be represented as the sum of a jump function and a smooth function whose 
mth derivative exists and is bounded for all x, then the minimax risk in estimating 0 
is of order min{e@™+2)/@m+2B+1) 2/78 D). provided that the tails of the Fourier 
transform K of K behave like |w|—*, as |w| — oo, with B > 0. The elbow in the 
rates of convergence corresponds to the cases where f > 1/2 and 0 < B < 1/2. 
If B > 1/2, the convolution kernel K belongs to L2 (IR). In what follows we call 
such convolution kernels and the corresponding setup regular. In contrast, under 
0 < B x 1/2, the convolution kernel K does not belong to La (R). We will call the 
latter case singular because it necessarily corresponds to a singular convolution 
integral in (2). 

We introduce a new estimation technique that involves, as a key element, de- 
tection of zero crossings of an estimate of a properly smoothed second derivative 


CHANGE-POINT ESTIMATION 353 


of f. This differs from most change-point detection methods described in the sta- 
tistical literature that typically use a properly smoothed first derivative of f. On 
the other hand, our second derivative based approach has parallels in digital im- 
age processing in the context of edge detection, where it is often referred to as the 
Laplacian method (cf. [11]). It is interesting to note that in the regular case seem- 
ingly intuitive procedures based on detecting a maximum in the first derivative 
lead to slower rates of convergence (see further discussion in Section 5). 

The optimal rate of convergence in the regular case, c" +2)/@2m+28+1) clarifies 
how smoothness of f away from the change-point (given by the index m) and ill- 
posedness of the kernel K (given by £) affect achievable accuracy in change-point 
estimation from indirect observations. The result of Neumann [19] in the density 
deconvolution model, with standard calibration e = n^ '?, can be viewed as the 
"density analog" of a special case of our result with m = 1, that is, when f is 
Lipschitz apart from the change-point. When the "smooth part" of the unknown 
function f is analytic, our results show that in the regular case the optimal rate 
is £, up to a logarithmic factor in e~!, that is, it is nearly the parametric rate. In- 
terestingly, in this case the ill-posedness index B of K appears in the risk bound 
only as a power of the logarithmic factor. This means that ill-posedness of K does 
not affect significantly the quality of estimation when f is very smooth apart from 
the change-point. We also show that in the singular case the optimal rate of con- 
vergence is &^/ P *U. up to a logarithmic factor, regardless of the smoothness of f 
away from the change-point. 

Our results elucidate the following important feature of the problem: when esti- 
mating a change-point from indirect data, the best achievable rates of convergence 
depend on the behavior of the function f away from the change-point location. 
This is in striking contrast to the direct observations case, where the rate £? is 
the best one can achieve regardless of bow many derivatives f possesses apart 
from the discontinuity jump. Our results also indicate that the procedure of Rai- 
mondo [21] is not optimal when estimating a change-point of the fth derivative, 
P = 1, in the direct observation model, contrary to what is claimed in that paper 
(see further discussion in Section 5). We note that estimating a change-point from 
indirect observations can be done with higher accuracy than curve estimation in 
nonparametric deconvolution (see, e.g., [5-7, 10], and [8]). 

The rest of the paper is organized as follows. Section 2 introduces notation and 
definitions of the functional classes. In Section 3 we construct a probe functional 
that is used for detection of the change-point from indirect observations; some 
properties of the probe functional are discussed, and its estimator is developed. 
section 4 describes the two-stage change-point estimation procedure and presents 
our main results. Section 5 concludes with a discussion of the main results and 
section 6 contains the proofs. 


354 A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


2. Preliminaries. We begin with some notation and definitions. Let g or (Fg) 
denote the Fourier transform of a function g € LL2(R), in particular, if g € L;(R)N 
Lo(R), 


(o) = (Fg)(o) € J "Gm dx, WER, 


Let f (x+) = lim, 4. f (t) be the one-sided limits of f at point x and let [ £](x) = 
f (x+) — f(x—) be the local jump function. We say that 0 € R is a change-point 
of f if [£](0) 40, and f (0+) and f (60—) are finite. 

We will consider minimax estimation of a change-point 0 of f by assuming 
that f belongs to one of the two functional classes, Fip or »4,, defined below. 


DEFINITION 1. Let a, L > 0 be fixed constants. We say that f € Fi = 
Ji(a, L) if f € L2(R) and if f has a single change-point 6 € [0, 1] such that 


[FIO] zaand | 
lf) —f@)|<Llx—x"| | Vxax'eR,xxx,0dx,x]. 


The class ¥; contains functions f that have a single jump discontinuity of the 
first kind at 0 € [0, 1] and satisfy the Lipschitz condition on any interval that does 
not include 0. This class was considered by Neumann [19] in the context of density 
deconvolution. To allow more smoothness of f apart from the jump discontinuity, 
we introduce the following extension of Fi. 

Korg ^ ! 

DEFINITION 2. Let a, L > 0 and m > 1 be fixed constants. We say that f € 
Fm = Fm(a, L) if f € Le(R) and if f has a single change-point 0 € [0, 1] such 
that the following conditions hold: 


Gi) We have [Lf 1(8)] > a. 
(ii) For all x 46 and [f’](@) — 0, f’(x) exists so that the function g f :IR —^ R 
defined by | 


fe), — x£6, 
is continuous. 
(iii) The function gp belongs to IL? (R) and its Fourier transform gf satisfies 


(4) J - Br (lol! do <L. 


If m is an integer, condition (4) implies that the derivative po exists and is 


bounded by L. In fact, (4) is only slightly stronger than this property. For exam- 
ple, (4) is valid if g y is in a Sobolev class of Lz smoothness s > m — 1/2, that is, 
when f [£ f (c) |? |w|** dw is bounded by an appropriate constant. This class is very 


CHANGE-POINT ESTIMATION itu 355 


close to, but smaller than, the class of functions with — bounded derivative 
(m-1) 
Ef | 
It is important that in Definition 2 we have FG). = 0. If P " 0, parts 
(ii) and (iii) of Definition 2 cannot be satisfied, but we may still consider classes 
of functions f that are smooth separately to the left and-to the right of the 
change-point. However, introducing such classes seems to be unjustified, bécause 
additional smoothness in these terms does not improve the convergence rate of 
estimators of 8. The minimax rate remains the same as for the class 71. 


DEFINITION 3. Let v, a, L > 0 be fixed constants. We say that f € A, = 
Xy (a, L) if f € La R), and if f has a single change-point 0 € [0, 1] such that 
conditions (1) and (11) of Definition 2 are satisfied and 


(5) T Ig (w) 2 exp(2vlol) do < L^; 
M | un 
where g y is defined in (3). | Fo ux $4 


Assumption (5) implies that g y is infinitely differentiable and admits an analyt- 
ical continuation onto a strip in the complex plane. Such classes of functions have 
been studied in the context of nonparametric estimation by many authors, starting 
with Ibragimov and Hasminskii [14]. For a recent overview, see, for example, [2]. 

The following assumption on K will be used throughout this paper. 


ASSUMPTION K. The function K belongs to IL; (IR), and there exist constants 
P > 0 (called the ill-posedness index of K) and x, x > 0, such that 


(6) &(1 4 lo) 8? «|K(o)| xk(0 1o") ^^ — voeR. 
Assumption K is quite standard in deconvolution problems and corresponds to 
what is known as a moderately ill-posed problem. Green’s: functions of linear dif- 
ferential operators are important examples of kernels K satisfying Assumption K 
for the regular case (8 > 1/2). For instance, let v(x) = e*, —oo < x < 0, v(0) = 
1/2 and v(x) = 0, 0 < x < oo. For nonvanishing real constants b,, j = 1,...,k, 
we define mE . di 2 


(7) vj(x) = |bj|v(bjx) and K =v * uv *---* Up, 


where * stands for the convolution on R. The Fourier transform of the kernel K is 
given by K (o) = T _;a- 2xb;ie)) |, and Assumption K holds with Bk. 
In this case f can be recovered from Kf by applying the linear differential operator 
k 

(8) fe- n — b> 4) «oo: 

j=l 
see [12], Chapter II. As for the singular case (0 < B < 1/2), examples are more 
peculiar; for instance, one may consider K to be the probability density of a gamma 
distribution with shape parameter f. 


356 A GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


3. Probe functional. We will develop estimation procedures that are based 
on minimization of an empirical version of a properly chosen probe functional. 
Let o :IR — R be an even, twice continuously differentiable function that attains 
its global maximum at 0. Further conditions on q will be introduced below. Fix 
a bandwidth h > 0 and for t € R, x € R define yy (x) = h?g" (h7! (x — t)). Let 
(-,-) denote the standard inner product in La (IR). Assuming that qo" € L)(R) à 
IL»(RR), we define the probe functional 


BO =F ne Í. * f(x) dx 


=z f fee (5 - ax. t eR. 


The probe functional £} (t) is thus a smoothed second derivative of f at point t: 
as h tends to zero, £;,(t) converges to f" (t), provided that f is twice continuously 
differentiable at t. The points £ where |£,(t)| is close to zero are indicative of the 
change-point location; this idea underlies the construction. 

An estimator of £4(t) based on observations (1) can be developed as follows. 
Denote by K* the adjoint operator to K given by 


(10) Kae) f - K(y—x)g(y)dy, gE La(R). 


By the linear functional strategy, if y; € Range(K*), then there exists a function 
y: € L2(R) such that 


£s (t) = (f, i) = (Ky) = KP yn. 


The function y, satisfies (K*y;)(x) = Vi (x) = ho" (h-1(x — t)) almost every- 
where in R. Taking the Fourier transforms, and using (10) and the fact that 
V; (wm) = —Qx o^ Qg(oh)e?"'?! , we find 


(9) 


—(2nw)*e 2niwt Q(oh) 

K(-o) 
We will always choose ¢ so that y; € Li (IR) N L2 (IR). Under this assumption we 
may write 


OO co h 
yx) f Floe 7119» do -— - | (2m o^ e2io (7x) Pw a 
n = a 


Vi (o) = 


and 
OO CO 
w= fnd] 0000004. 
OQ —00 
Based on these considerations, we define the estimator £, (t) of £4(t) by 


b= | y(x)dY(x), teR. 


CHANGE-POINT ESTIMATION 357 


Properties of estimation procedures that we develop are determined crucially 
by (i) accuracy of the probe functional estimator £4(t) and (11) the ability of the 
probe functional to detect the change-point. The former is quantified by tbe next 
lemma, which we prove under the following assumption on the smoothing func- 
tion @. 


ASSUMPTION 1. The Fourier transform $ of o € IL; (R) satisfies 
lo 
J|. tof^*5ipt do < oo; 
—oo 
that is, 9 belongs to the Sobolev space with smoothness index f + 3. 


LEMMA 1. Let Assumption 1 and the left inequality in Assumption K hold, 
and let h > 0. Then the Gaussian random process Zg(1) = £&y(t) — £4(t), t € B, 
where B is a subinterval of [0, 1], satisfies 


(11) E[Z,()]=0, ^ o 7 supE[Z2(0] < Cye2h 7925. 
teB 


Furthermore, for any X. > 207 
B+3/2 








1242845 
JiB exp | - 575 — |. 


where |B| stands for the Lebesgue measure of the set B and C,, i — 1,2,3, are 
positive constants. 


(12) P| sup Z(t >a] < o,(* 
teB 


Further results will be obtained under the following condition which is stronger 
than Assumption 1. 


ASSUMPTION 2. The Fourier transform $ of g € L2(R) is an even, nonneg- 
ative, infinitely differentiable function supported on [—2/3, —1/3] U [1/3, 2/3] 
and taking the value 1 on [—2/3 + n, —1/3 — n] U [1/3 + 5, 2/3 — n] fer some 
n € (0, 1/32). 


It follows from Assumption 2 that ø is a real-valued, even, analytic function, 
rapidly decreasing at infinity, together with all its derivatives. In addition, since 
Q9 is nonnegative, o achieves its global maximum at x = 0, g’(0) = 0 and |g"(0)| > 
M > 0 for some constant M. The Meyer wavelet (see, e.g., [17], Section 7.2.2), 
centered at zero and rescaled accordingly, provides an example of a function that 
satisfies Assumption 2. 

We summarize some properties of y’ that will be repeatedly used in what fol- 
lows. 


(D) For all x € R, g’(0) — 0 and g'(x) = —e'(—x). 


358 A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


(II) The function g’ decreases in [0,3/8], increases in [3/4,9/8] and has 
a unique minimum in [3/8, 3/4], which is the point of the global minimum, 
x = qx € [3/8, 3/4]. By (D, g’ attains its global maximum at x = —q, € 


[—3/4, —3/8]. 
(IIT) There exists a unique zero of g’ in the interval [3/4, 3/2]. We denote it by qo 
and let d= a — q| > 0. We have that g’ < 0 on [0, go] and 
13 x)— 1zr-0 
(13) . LP (x) — 9'(44)) = 


for some constant r. 


Proofs of (UHI) are immediate and based on analysis of the integrand sign in 
the expressions for g’ and o”. Parameters qx, qo and d that appear in (I) and (IIT) 
depend on the specific choice of y; condition (13) asserts that g, is a well-separated 
point of the global minimum of g”. 

The next lemma analyzes the separation between values of the probe functional 
£5 (t) when t varies in a “punctured” neighborhood of the change-point 0. 


LEMMA 2 (Separation rate). Let Assumption 2 hold and let 6 € (0, gh), where 
G = qx + 3d/4, and the constants q, and d are given in (IT) and (III). 


1. Let f € Fm and let 


(14) ô > Ci(L/a)h"*l 
for an absolute constant C, > 0 large enough. Then for sufficiently small h, 
(15) ., £n] — la (6)) = Coadh™, 

[: pot —0]|«q 


where C^ is a positive constant that depends on only a, L and q. 
2. Let f € Ay and let 


(16) ô > C3(L/a)hexp(—v/QR)] 


for an absolute constant C4 > 0 large enough. Then (15) holds for sufficiently 
small h. 


The value that appears on the nght-hand side (RHS) of (15) will be called the 
ó-separation rate that corresponds to the probe functional £}. Lemma 2 asserts 
that Ó is a well-separated point of minimum of |£; (t)|, provided that h and ó satisfy 
(14) and (16) for f € Fm and f € Ap, respectively. Conditions (14) and (16) are 
required to guarantee that the bias terms do not exceed the contrast expressed by 
the ó-separation rate. They also show that if f € A,, the value ô can be chosen 
much smaller than in the case of f € Fm; that is, the minimum of |Z4(t)| is more 
pronounced when f € Ay. It is interesting to note that if a smoothed first derivative 
of f is used as the probe functional and the maximum is sought, the corresponding 
8-separation rate is of order 6*h~>. As our proofs suggest, in the regular case this 
choice of the probe functional does not lead to a rate optimal estimation procedure 
(see Section 5). 


CHANGE-POINT ESTIMATION 359 


4. Estimation procedure and main results. We are now in position to de- 
fine the estimation procedure. The construction has two stages: First we localize 
the region that contains the change-point with probability close to 1 and then we 
search for a minimum of tbe absolute value of the probe functional inside the re- 
gion. 

The localization step is based on the following argument. As the proof of 
Lemma 2 shows, £4(t) is equal to —h~*g’(h—!(6 — t))[f1(@) up to a term 
that is negligible, provided that |@ — t| = O(h). It follows from (II) that 

def . E def 
f, = argmin;eo,1(—9 ((0. — t)& [f£ 1(0)) and t* = argmaxiqo,(—-9' (0 — 
t)h- (f (0)) are within the distance O(h) from 6: 


I —0|]-q.h €[3h/8,3h/4], — |t* — 8] = qah € [Bh/8,3h/A]. 


In addition, |t, — t*| = 2g,h € [3h/4, 3h/2]. If ( £K](0) < 0, then t* > t, and [t,, t*] 
contains 0; if [f](@) > 0, then t* < t, and 0 c [t*, tą]. Both 4, and t* can be 
estimated from the data. This fact will be used to find an interval of size O(h) that 
contains the change-point with probability close to 1. 

Let 


(17) i, = arg min £, (t), arg max ĉa (t) 
t c[0,1] t€[0,1] 


def 
í* — 


and let Ap be the closed interval with endpoints £, and f* . Our estimator 6; of the 
change-point for given bandwidth h is defined by 


(18) 6, = arg min [£4 (£)]. 


tC ÁR 
We note that this construction depends on the bandwidth h that will be chosen in 
an optimal way. 
4.1. Functional class Fm. 


THEOREM 1. Suppose that the left inequality in (6) is satisfied and that As- 
sumption 2 holds. Let 0, denote the change-point estimator 0; with the bandwidth 
h = h, defined below. 


|. Regular case: Assume that B > 1/2 and let 


where CT > 0 is a constant. Then there exists a constant C5 < oo independent of 
a, L such that 


360 A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


2. Singular case: Assume that 0 < D < 1/2 and let 


1 2/(2B+1) 
(20) Hc e(t) ; 


where C3 > 0 is a sufficiently large constant. Then there exists a constant C5 < oo 
such that 


1\ (-2B)/2(2B+))) 
) YO<e<l. 


Rel: Fm] < Che2/ P+ (in - 
E 


THEOREM 2. Let the right-hand side inequality in (6) hold and assume that 
IK (c) | #0 for all w. Then, for sufficiently small £, the minimax risk over the class 
Fm is bounded from below as follows: 


1. Regular case: If B > 1/2, then 
(21) R*[ Fn] > cta- 1 LO8-D/GBS2m- D, Qn 2/Om-d2841). 


where cj does not depend on a, L. 
2. Singular case: /f 0 < B < 1/2, then 


1 —1/2 
x = FR 
(22) R*[ Fn] > cSe(In-) e BIS 
cte? OB). if0 <B < 1/2, 


where C^, C3 > Ü are constants. 


Theorems 1 and 2 show that the estimate 6, is rate optimal in the regular case. 
Moreover, the bounds identify the precise dependence of the minimax risk on the 
size of the jump a and on the “Lipschitz constant" L away from the jump (see 
definition of the class Fm). As smoothness of f away from the jump increases 
(1.e., aS m — oo), the optimal rate of convergence approaches the usual parametric 
rate £. In the singular case, 6, is nearly rate optimal up to a factor logarithmic 
in £^!. Here the order of the rate of convergence is faster than the parametric 
rate €, but slower than &?, the minimax rate achieved in change-point estimation 
under direct observations. ft follows from Lemma 3 in Section 6 that the estimators 
f, and £* defined in (17) and associated with the bandwidth (20) are also nearly 
rate optimal in the singular case. We conjecture that for 0 « B « 1/2 the extra 
logarithmic factor in the bound of Theorem 1 can be removed and thus e7/@F +1) 
is the optimal rate of convergence for such values of f. 


4.2. Functional class Ay. 


THEOREM 3. Suppose that the left-hand side inequality in (6) is satisfied and 


that Assumption 2 holds. Let 6. denote the change-point estimator 6, with the 
bandwidth h = h4 defined below. 


CHANGE-POINT ESTIMATION 361 


1. Regular case: Assume that p > 1/2 and let 
ví L 1 |e oe 


Then there exists a constant C$ < oo independent of v, a, L such that 
" WAR NEA aL 
Re[64; Ary] < C$a- e(— n=) VO<e<l. 


2. Singular case: Assume that 0 < B < 1/2 and let, for some sufficiently large 
CE > 0, 


7 \ 2/05) 
(24) hy -a(s: ] 


Then there exists a constant C^; < oo such that 


1\ 0-285/QQ8-H1) 
) VO<e< Il. 


RelGy.; Ay] < Che2/CA+D (in L 

E 

THEOREM 4. Let the right-hand side inequality in (6) hold. Then for suffi- 

ciently small e the minimax risk over the class A, is bounded from below as fol- 
lows: 


1. Regular case: If B > 1/2, then 


i 1 L B-1/2 

(25) RY[ Ay] > cham e(< n=) | 
V € 
where c} is a constant independent of v, a, L. 
2. Singular case: /f 0 < B < 1/2, then 
1-12 
* B FB = 
Q6) raga] d (nz) «^ = 12 
QA LPS. if 0 < B < 1/2, 


where c3, cc > 0 are constants independent of v. 


Theorems 3 and 4 indicate that 6, is rate optimal in the regular case, and nearly 
rate optimal in the singular case. It is interesting to note that in the regular case, 
when f € A,, almost parametric rates of convergence are attained by our esti- 
mation procedure. The ill-posedness of the convolution operator K, as expressed 
by index P, does not have a significant effect on the rates of convergence when 
f € Ay; this fact is rather surprising. 


362 A. GOLDENSHLUGER, A. TSYBAKOV AND A ZEEVI 


5. Discussion. 1. Our technique elucidates how the construction of the probe 
functional affects estimation accuracy. An appropriate probe functional, £}, should 
satisfy the following two requirements: (1) 0 is a well-separated point of minimum 
(maximum) of £4 (-) or |€,(-)|; Gi) £a (t) admits a "good" estimator with “small” 
bias and variance. The proofs suggest that in the regular case the optimal rates of 
convergence are obtained by balancing three quantities: the 5-separation rate, the 
bias and the stochastic error of estimation of a properly chosen probe functional. 
In the singular case the bias is asymptotically negligible and the optimal rates 
are obtained by balancing only two terms: the 5-separation rate and the stochastic 
error. As an illustration, consider, for instance, the regular case and estimation on 
classes Fm. For our functional £}, the 5-separation rate in (15) is of order 6h, 
the stochastic error is characterized by the square root of the variance Var[£, (t)| = 
O (&?h 78) [see (11)] and the bias term is of order h™~* [see (30) and (32)]. 
The balance between the three terms is given by the relationships 8h ^? =x h™~? x 
eh? —/?. Solving for this, we get the optimal bandwidth h x e?/Cm+28+1) and 
the corresponding optimal rate 8 eU *2/Om-2831) 

2. The proposed estimator is based on a local search for a zero of the smoothed 
second derivative of f. An alternative and seemingly natural approach would be 
to estimate @ by searching for a maximum of a smoothed first derivative of f, that 
is, to consider the probe functional wj (t) = h^? f f (x)g'(h- (x — t)) dx. This, 
however, does not lead to rate optimal estimation in the regular case. Although the 
stochastic error of the corresponding estimator 104 (f) is smaller than that of AO 
[Var[105 (t)] = O(e7h~2F-3)], the ó-separation rate is of order 8^h-?. The bias 
term is now of order h”~!; this follows from similar arguments as in the proof 
of Lemma 2. By balancing the three terms (bias, stochastic error and 6-separation 
rate), as explained in the previous remark in this section, it is not difficult to ver- 
ify that the estimator based on a local maximization of |w,(t)| has risk of order 
e(m+2)/(2m+28+1) when f € Fn and B > 1/2. We recall that the optimal rate of 
convergence given by Theorem 1 is faster, e¢"+2)/@™+26+1)) 

3. The results of the present paper cover the problem of estimating change- 
points in the Bth derivative of a function from direct observations. In particular, 
assume that B is an integer and let K be the Green's function of a linear differential 
operator of order P as defined in (7). Denoting q = Kf, we note that estimating 
the change-point in f from indirect observations (1) is equivalent to estimating 
the change-point in g from direct observation of q in the white noise model. 
Indeed, in view of the inversion formula (8), if f has a change-point at 0, then 
g® (or any linear differential form of q of order B) has a change-point at @ as 
well. Therefore, regarding the observations of q in the white noise model as indi- 
rect observations of f, with K being the Green's function of a linear differential 
operator, we can apply our procedure to estimation of change-points in q (^. In 
particular, according to Theorems 1 and 2, if g) satisfies the Lipschitz condition 
away from the change-point (i.e., m = 1), the best achievable rate of convergence 


CHANGE-POINT ESTIMATION 363 


is e4/@B+3) which can be easily extended to the regression problem with equidis- 
tant design, where the rate becomes n~?/@+3)_ This indicates that change-point 
estimation procedures in [21], as well as in [13, 20, 26], are not rate optimal, con- 
trary to what is claimed in some of these papers. Specifically, as a referee pointed 
out, the lower bound of Raimondo [21] is not correct. Át the same time, our results 
are consistent with those obtained for density deconvolution by Neumann [19], 
who only considered the case of Lipschitz smoothness (m = 1). 


6. Proofs and auxiliary results. In what follows C, c, C; and cj, i = 1,2,..., 
stand for positive constants that may differ on different occurrences. 


PROOF OF LEMMA 1. Assumptions 1 and K imply that y; € L2(R). Since 
f£ € LR) and K € Li(R) we have that Kf € L2 (R); thus the Fourier transform 
K Kf exists and Kf = K f. Using (1) and Plancherel's formula we get, for any t € B, 


EBO f E y GYKf)G)dx 
= [WOR WFO) do 


=- [^ Qxoyei"och)fo)do 


B f. Flo) f (v) do = (f, We) = Er Ct), 


which proves that E[Z,(t)] = 0. Thus Z;(t) is a zero mean Gaussian random 
variable with variance 


E[Z2(1— E (e f.» awe) | 
=8 [motas 


"mE T us 4 |@(wh) |* 
= ae o EC 





e? 
aces | OPC lo? ao 
< eeh 265, 


where we have used Assumptions K and 1. This proves (11). 
To prove (12) we apply the inequality on the tails of Gaussian processes (see, 
e.g., [24], Proposition A.2.7). For this purpose, using Plancherel’s formula and 


364 A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


Assumptions K and 1 we obtain, for t, s € [0, 1], 


o^ (t,s) € E|Z(t) — Z()f? 


= e | Iy (x) — ys)" dx 


oo ay 2 
_ zl Oot LEOR amor _ amies 2 Je 
—00 |K(—) |? 


OO 
< cse? |t — s|? | lel S1 + o |@(oh) do 


Oxo 
saeh PT sp J (1 + loD 319(oy?? do 
—0Q 


< c4g&h ?P— t == sui. 


Therefore, the number of balls of radius r in the seminorm o (t, s) that cover 
the interval B C [0, 1] does not exceed csr ^ ! eh? 7/?| B], and applying Proposi- 
tion A.2.7 of [24] (putting in the notation of that proposition £p = oz, 
K = eh- £7" |B]), we get the lemma. O 


PROOF OF LEMMA 2. 1. Fix 1 satisfying ô < |t — 0| < gh and define 
t = (0 —t)/h; clearly g > |t| > ó/ h. By (9), 


1 f? ,fx-—t ] 6? ,fíx—t 
0-5] e (5 reas e s f o"( n )reods 


Lope 1 (o 
=f "I efeexmass zs | lO fü e xh) dx. 


First assume that m = 1. Let 


c I : H 
108 f _ e" GOLFG x) — f(0—)]dx 





(27) 


1 OQ 

ex] vUa + xh) — OH) dz. 
Then using Definition 1 and the fact that g’(—oo) = q'(oo) = 0, we obtain 

l ] 

£y (1) = 759 (0f (6— E 770 OSOH T JA) 
(28) 
I 
= -339 OLIO) + A). 


Recall that @’(0) = 0 and |g/(0)| > M > 0. In addition, by (D-(IID, g'(x) has 
a unique zero in the interval [—4, 4] at x = 0. Therefore, |g'(1)| > cilc] for all 
|t| € (6/ h, q) and, for h small enough, we get 


1 
(29) POOE cah ?|v| > csa8h ^? . 


CHANGE-POINT ESTIMATION 365 


Furthermore, we note that £4,(0) = Jı (0) and that, for all t satisfying ô < |t — 0| < 
gh, 


(30) A(O1x = f 


oO 


Ie" (II — x] dx < caLh |. 
oO 


Using (14) we conclude that for sufficiently small A, the sign of £} (t) is determined 
by the first term in (28). Therefore, (15) holds for m — 1. 
If m > 1, then integrating by parts in both integrals on the RHS of (27) and 
using the fact that 9'(—oo) = '(oo) = 0, we obtain 
1 
(31) £y (t) = -339 IFI) — Jo(t) 
with 
1 OO 
nZ- f ear xh)dx 
=00 


1 oo s 
E 2l (—2zio)9()e ^" ^2 5 (—w/h) do, 
OO 


where gp is defined in (3) and the last equality follows from the Plancherel for- 
mula. In view of Assumption 2 and Definition 2, 
Po f” 

Jo(t)| < — 
PADS h? 1/3<|o|<2/3 |v|" ^7 J—co 
This along with (14), (29), (31) and the fact that £4,(0) = —J2(@) completes the 
proof of the first statement of the lemma. 

2. If f € Ay, then £} (t) is again given by (31) and J2(t) is now bounded using 
the Cauchy-Schwarz inequality: 


|Js (t san? J 





(32) lol 1 [8 7 (@/h)| de x csLh™™. 


-" 1/2 
OPIPA) expl -2vlol} do 
OO 


i 1 [£r (I^ exp{2v]æ]} do ‘ 


1/2 
< «e| f lof expl -2vlol] do} 
1/(35) x|o| x2/ (3A) 


< c; Lh * exp{—v/(3h)}. 
The same considerations as above complete the proof. U 
LEMMA 3. Let Assumption 1 and the left inequality in Assumption K hold, 
and let hP+'/*e—! > Ci. Then, for f € Fm or f € Ay and for all h small enough, 
max [P v (| — t| > hd/2}, P -{|f* —t*| > hd/2}} 


h2b+! 
e? l, 





< C)RP !/2e7! exp] -C3 


366 A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 
where d > 0 is given in (III). 


PROOF. We will derive the inequality for P x {|f — £,| > hd/2} only; the proof 
of the other part is identical in every detail. Define A = {t € [0, 1]:|t — t| > 
hd/2}. By definition of t, and ft, we have 


7 -fl> hd /2) < < Pr(3t € A: la(te) = £x (t) 


=P {3t eA: [En (te) — la(te)] + Lea) — £n (E) 
> LaCt) — Ln (te)} 


<P; 2 sup (és) — OI > in (As — TON 


tefo, 


It follows from (28) and (31) that 


] 0 —t1t 0 —t, 
£40) aed = 310 | (5) - e (—*)| 10 - 762. 
where, J (t) E JLE) if f € Fy and J(r) 8 —J5(t) if f € £4, m> lor f € Ay, 
with Ji (t) and J5 (t) as defined in the proof of Lemma 2. Therefore, we obtain 


inf f (£h (t) — £n(te)) 
ted 


Bin a agito (x5) -(522)] eeo 


! i C 

EOC O 
where the last inequality follows by change of variables, by definition of t, and be- 
cause g'(x) = —g'(—x). Using property (IID, we obtain that the first term on the 
RHS of (33) is at least cirh ? , While ue second one does not exceed in absolute 
value coLh"—* if f € Fin and Lhe 2 exp{— v/(3h)} if f € Æ, (see the proof of 
Lemma 2, where upper bounds on |J (f)| were established). Noting that h^? domi- 
nates both A"? and h7? expt- v/(3h)} as h tends to zero and applying Lemma 1, 
we obtain that 


Pelli — t| > hd/2} < P;| sup |£q(t) — £a(t)| > ei? | 
t €[0,1] 





h28+! | 


< cshP- 1? 6! exp|-cs 5 
€ 


as Claimed.’ U 
PROOF OF THEOREM 1. 1. We begin with the regular case. The choice of 


h = h, in (19) implies that £^! A&* 1/2  g-m/(nPt1/2. so Lemma 3 can be ap- 
plied. Let Q be the event that |Â — t,| < hd/2 and |f* — t*| < hd/2. Recall that 


CHANGE-POINT ESTIMATION : i 367 


|t. — 8| = [r* — 0| = qh, where q, is defined in (ID). Therefore, on the set Q, . - 
(34) |t%} —O|<qeht+hd/2<gh and |Č -0| < gsh +hd/2 < ĝh, 


where q is defined in Lemma 2. Recall that, by property “ity gy’ (x). has a unique 
zero in the interval [—q, q] at the point x = 0. This guarantees that if $2 holds, the 
set Ap contains a unique zero of the function t > ọ '(h- (0 — t)) att = 0 and thus 
the definition (18) is justified. 

We write 


Eó, — 9? =E {lr — 6/1(Q)) + Er(10, — OPLO 


(35) a 
< Ej (IU, — 61(0)) +P), 


where 1(-) denotes the indicator function. By Lemma 3 we have 
IP(Q2^) = IP((If — ts] > hd/2) U (f* — t*| > h4/2)) 
(36) < cıhft!/2e7! exp{—czhPt! e ne 


—mj/(m-4-p--1/2) cL 


< C3E exp{—c4s~ 


Furthermore, when $2 holds, it follows from (34) and from the construction of Ôh 
that |6, — 0| < gh. Thus for any ô € (0, gh), the first term on the RHS of (35) can 
be bounded as 


J | "M 
(37) . Er(lü, — OPR) x 8^ + Y 8°27 Pr (16, — 81 € ^5) n Q), 
J=] 


where Aj = aod" l oe unu ee min(j :62/ > gh}. Let Tj er end }; 
we note that IT, | = 627— ' Then we have 


Pida —0] € ^,) n Q) m 
(38) <P (at ET iO = len!) —- HEIL 


< P; [2 sup £s) — £41 > inf (enol wo). 
teT, 
We first estimate infer, (Ea (H) | — |£5(80)]) using Lemma 2. Note that Lemma 2 
can be applied with (t: 8 < |t — 0| < gh} replaced by T, for each j = 1,..., J, 
provided that 8277 “lh > e5(L L/a)h"**l. In pom we Rage 

inf (|£& (t)| — |£,(0 £y(t)| — {2 6 

inf ( n(t)| — |£,(@)|) = wit n "m MI n(t)| — |£n( n) 
(39) p 

> cgaóh 2/, ES 


Let 
(40) $= c7(L/ayh™*} E cg, C8 D/QG8-2m-1) 5-1 (g^) 0 -D0/OB-E2m1) 


368 A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


It is straightforward to verify that with this choice of à and A for sufficiently 
large c7, conditions of Lemma 2 are satisfied. In addition, 2/a8h^? > cgeh~P->/? 
for some constant cg and each j = 1,... J. Therefore, using (15) and applying 
Lemma 1, we obtain 


P, | sup |Z (t)| > cgaóh 727 | 


teT, 


ppP+3/2 8h 732) )2 2845 
< cio( g Jiriiasn-on epf- | 





e) 4 E4312 





0282 26-1 | 


agt 4-7 exp| —c1327 zi 


< C12 


pb-3/2 
ad*27/ exp{—c1427/ ), 





* C12 


where we have taken into account that 2h2P-!e—* > c > 0 under (19) and (40). 
Note also that c14 can be made large enough by choice of cg in (40). Furthermore, 
since hP-3/2525—1 — p 2m1/25-1 — gm/(B+m+1/2) — o(1) as e > 0, we finally 
obtain from (41), (38) and (37) that 


J 
E (I9 — 61*1(02)) < 8? + 8^o(1) 9 2% exp{—c142} 
gel 


< &^(14- o(1)). 


Combining this with (35) and (36), we complete the proof of the first part of the 
theorem. 

2. For the singular case, the proof follows the same lines with minor modifica- 
tions; we indicate them below. 

The choice of h in (20) ensures that hP+!/2e-! = cis /Ine-! so that Lemma 3 
applies. In addition, by choice of Cj large enough, P(Q*°) = o(h^*") for any 
n > 0 and £ — 0. Arguing as in the proof of the first part, we see that inequali- 
ties (37)-(39) hold. For some constant c16 > 0, let 


E 


QA CERUDIG-HD 
(42) 8 —cjgeh PU? = cg / 12 In :) 


Under this choice, 


inf (ex — |£4(0)1) = cigóh ?2/ 


> cygh 979/221. Jeleni: 


CHANGE-POINT ESTIMATION 369 


The last inequality ensures that Lemma 1 can be applied and, similarly to (41), we 
have 


p—5/2 





8222! j?8-1 | 


h 
P; | sup Za (t)] = c1g5h 72! < c29 22 


teT, 


8^2?! exp | —€5| 


< 0225 ee?” exp{—c232}. 
Substituting expression (20) for h and summing up over j = 1,..., J, we complete 
the proof. UU 


PROOF OF THEOREM 2. We use the method of proving minimax lower 
bounds based on a reduction to the problem of testing two simple hypotheses; 
see, for example, [16], Chapter 2, or [23], Chapter 2. 

Pick fo € Fm(a, L/2) such that fo has a unique jump discontinuity at 65 = 0 
and [fo](0) = a. Fix ô € (0, 1] and define v(x) = aljo,5j(x), x € R. The Fourier 
transform of v is given by 


8 
Tw) =a | eTO dy = _" [e _ 47, 
0 2niw 
Fix N > 0 and define 
N 
UN (x) =) Toe do, x eR. 
—N 


The Fourier transform of this function is Uy (w) = v(o)1(lo| < N). Let 


fi (x) = fo(x) — [v(x) — vu (x)]. x eR. 


The function x +> fo(x) — v(x) has a unique jump at x = 6 with [ fo — v](6) = —a, 
while vy is infinitely differentiable. Therefore, x +> f(x) has a unique jump at 
x = ô with [f/1](8) = —a. Set 01 = ô, where the index 1 indicates that 0, is the 
change-point of fi. 

We now show that fı € Fm(a, L) under appropriate choice of N. First, clearly 
fi € Lo (IR), since fo, v, vy € La (R). Next, the Fourier transform of the derivative 
vy (x) is given by (Fui) (o) = (—2zio)Ux (o) = (—2zie)v(o)1(|o| x N) and 


OO j i OO 
| KFv Xlll" do = 20 | iow (cw) lo" deo 
Ds p 


N 
(43) =a | Prison dog met dos 
—N 


« Ama Qm 
m+] 
In what follows choose N = (ete yl/ (m+!) Then the expression in (43) is less 
than L/2. 


370 A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


First let m = 1. Then (43) implies that the derivative |vy (x)| is uniformly in 
x € R bounded by L/2 and thus vy is Lipschitz continuous with Lipschitz constant 
L/2 on R. Also, fo — v has this property apart from 6; = 6. Hence, f| is Lipschitz 
continuous with Lipschitz constant L apart from 01 = ô, which proves that fj € 
(a, L). 

Now let m > 1. Then gf = gf, + vy, with gy defined in (3) and 


oo 
[ig Dll" do < L2, 
—0o 
since fo € Fm (a, L/2). This and (43) prove that, under our choice of N, 


oo 
[iG Colo" ao x L. 
—OO 


We have thus shown that fi € Fm(a, L). 

For brevity, let Po and P; denote the probability measures associated with the 
observations Y = {Y(x):x € R} in model (1) with f = fo and f = fi, respec- 
tively. In view of Girsanov's formula, the Kullback-Leibler divergence between 
Po and P; has the form 


dP 
Ko, Pi)= | In dPo = 55 ;;IKGo ADN 


The function A = fo — f1 = v — vy belongs to IL? (R) and its Fourier transform 1S 
given by Un (w)1(|@| > N). Since K € Lı (R), KA exists and KA = K A. Hence, 
by Plancherel’s formula, 


1 " 
(44) X (Bo, P1) = 73 NP DPR do. 


Assume first that 8 > 1/2. Then 


(a D 
K (Po, P1) < ci 2——N ^? 
_ M a i i 


where we have used Assumption K and the fact that |v(w)| < aó V œw € R. Choos- 
ing 
8g 1 LO6-D/QOfBo2m41) 2(m41)/Qf--2m-H1) 


we ensure that K (Po, P1) < a < oo for e small enough. On the other hand, 
ly — 6;| = ó and it follows from part (ii) of Theorem 2.2 in [23] that 
SUP fes. E ,|Ó — 6? > 6367. This completes the proof of (21) for B > 1/2. 

Now let 0 < B < 1/2. We decompose the domain of integration in (44) into two 
parts: N < |w| x N’ and |o| > N’, where N’ = 1/8. For N < |œ] < N’ we bound 


CHANGE-POINT ESTIMATION 371 


the integrand as above, while for |w| > N’ we use the fact that |0(@)| < Gr|o) m+. 
This yields 


8? d 1 d 
(45) K(Po, P) < «(7 | 2 +a ess) 


N «|o|N' |o|?8 o» N' |o |?*?8 
Hence, for B = 1/2 we obtain 


82 , 1 


and the choice of 6 x (ln D-V ? allows us to conclude the proof using the same 
argument as in the case of B > 1/2. Finally, for 0 < B < 1/2, we get from (45) that 


5*(N’)!~2B 1 
K (Po, P1) < e( —— aan) 


and the choice of à x e*/@P+) yields the boundedness of the last expression and 
hence the desired result. C 


The proofs of Theorems 3 and 4 follow the same steps as the proofs of Theo- 
rems | and 2 with slight modifications and are omitted. 


REFERENCES 


[1] ANTONIADIS, A. and GUIJBELS, I. (2002). Detecting abrupt changes by wavelet methods. 
J. Nonparametr. Statist. 14 7-29 MR1905582 
[2] BELITSER, E. and LEVIT, B (2001). Asymptotically loca] minimax estimation of infinitely 
smooth density with censored data. Ann. Inst. Statist. Math. 53 289—306. MR1841137 
[3] BROWN, L. D. and Low, M. G. (1996). Asymptotic equivalence of nonparametric regression 
and white noise. Ann Statist. 24 2384—2398. MR1425958 
[4] CARLSTEIN, E., MULLER, H.-G. and SIEGMUND, D., eds. (1994). Change—Point Problems. 
IMS, Hayward, CA. MR1477909 
[5] CARROLL, R. J. and HALL, P. (1988). Optimal rates of convergence for deconvolving a den- 
sity. J. Amer. Statist. Assoc. 83 1184-1186. MR0997599 
[6] CAVALIER, L. and TSYBAKOV, A. B. (2002). Sharp adaptation for inverse problems with 
random noise Probab. Theory Related Fields 123 323—354. MR1918537 
[7] FAN, J. (1991). On the optimal rates of convergence for nonparametric deconvolution prob- 
lems. Ann. Statist. 19 1257-1272. MR1126324 
[8] FAN, J. and Koo, J.-Y (2002). Wavelet deconvolution. IEEE Trans Inform. Theory 48 734— 
747. MR1889978 
[9] GUBELS, L, HALL, P. and KNEIP, A. (1999). On the estimation of jump points in smooth 
curves. Ann. Inst. Statist. Math. 51 231—251. MR1707773 
[10] GOLDENSHLUGER, A. (1999). On pointwise adaptive nonparametric deconvolution. Bernoulli 
5 907-925. MR1715444 
[11] GONZALEZ, R. C. and Woops, R. E. (1992). Digital Image Processing. Addison—Wesley, 
Reading, MA. 
[12] HIRSCHMAN, I. I. and WIDDER, D V. (1955). The Convolution Transform. Princeton Univ. 
Press. MR0073746 


312 
[13] 
[14] 
[15] 
[16] 
[17] 
[18] 
[19] 
[20] 
[21] 
[22] 
[23] 
[24] 
[25] 


[26] 


A. GOLDENSHLUGER, A. TSYBAKOV AND A. ZEEVI 


HUH, J. and CARRIERE, K. C. (2002). Estimation of regression functions with a discontinuity 
in a derivative with local polynomial fits. Statist Probab. Lett 56 329-343. MR1892994 

IBRAGIMOVY, I. A. and HASMINSKII, R Z (1983). Estimation of distribution density. J Soviet 
Math. 25 40-57 

KOROSTELEV, A. P (1987). Minimax estimation of a discontinuous signal. Theory Probab. 
Appl 32 727-130. MR0927265 

KOROSTELEV, A. P. and TSYBAKOV, A. B. (1993). Minimax Theory of Image Reconstruction 
Lecture Notes in Statist. 82. Springer, New York. MR1226450 

MALLAT, S (1998) A Wavelet Tour of Signal Processing. Academic Press, San Diego, CA 
MR1614527 

MULLER, H.-G. (1992). Change-points 1n nonparametric regression analysis. Ann. Statist. 20 
737-761. MR1165590 

NEUMANN, M. H. (1997) Optimal change-point estimation 1n inverse problems Scand. J. 
Stanst. 24 503-521 MR1615339 

PARK, C.-W. and KIM, W.-C. (2004). Estimation of a regression function with a sharp change 
point using boundary wavelets. Statist. Probab. Lett. 66 435-448 MR2045137 

RAIMONDO, M (1998) Minimax estimation of sharp change points. Ann. Statist. 26 1379— 
1397. MR1647673 

SPOKOINY, V (1998). Estimation of a function with discontinuities via local polynomial fit 
with an adaptive window choice. Ann. Statist. 26 1356-1378. MR1647669 

TSYBAKOV, A. B (2004) Introduction à l'estimation non-paramétrique. Springer, Berlin. 
MR2013911 

VAN DER VAART, A. and WELLNER, J. (1996) Weak Convergence and Empirical Processes. 
With Applications to Statistics. Springer, New York. MR1385671 

WANG, Y. (1995) Jump and sharp cusp detection by wavelets Biometrika 82 385—397. 
MR1354236 

WANG, Y. (1999) Change-points via wavelets for indirect data. Statist Sinica 9 103-117. 
MR1678883 


[27] YIN, Y Q. (1988). Detection of the number, locations and magnitudes of jumps. Comm Statist. 
Stochastic Models 4 445—455. MR0971600 
A GOLDENSHLUGER A TSYBAKOV 
DEPARTMENT OF STATISTICS LABORATOIRE DE PROBABILITÉS 
UNIVERSITY OF HAIFA ET MODELES ALEATOIRES 
HAIFA 31905 UNIVERSITE PARIS VI 
ISRAEL 4 PLACE JUSSIEU 
E-MAIL: goldensh @stat haifa ac.1l PARIS 75252 
FRANCE 


E-MAIL: tsybakov @ccr Jussieu fr 


A. ZEEVI 

GRADUATE SCHOOL OF BUSINESS 
COLUMBIA UNIVERSITY 

3022 BROADWAY 

NEW YORK 10027 

USA 

E-MAIL. assaf@gsb columbia edu 


The Arnals of Stanstics 

2006, Vol 34, No 1, 373—393 

DOI 10 1214/009053605000000741 

© Institute of Mathematcal Statistics, 2006 


ESTIMATING THE PROPORTION OF FALSE NULL 
HYPOTHESES AMONG A LARGE NUMBER OF 
INDEPENDENTLY TESTED HYPOTHESES 


BY NICOLAI MEINSHAUSEN AND JOHN RICE 
ETH Zürich and University of California, Berkeley 


We consider the problem of estimating the number of false null hypothe- 
ses among a very large number of independently tested hypotheses, focusing 
on the situation in which the proportion of false null hypotheses 1s very small 
We propose a family of methods for establishing lower 100(1 — @)% confi- 
dence bounds for this proportion, based on the empirical distribution of the 
p-values of the tests. Methods in this family are then compared in terms of 
ability to consistently estimate the proportion by letting œ — 0 as the num- 
ber of hypothesis tests increases and the proportion decreases This work is 
motivated by a signal detection problem that occurs in astronomy. 


1. Introduction. An example that motivated our work is afforded by the 
Taiwanese—American Occultation Survey (TAOS), which we now briefly describe. 
The TAOS will attempt to detect small objects in the Kuiper Belt, a region of the 
solar system beyond the orbit of Neptune. The Kuiper Belt contains an unknown 
number of objects (ABOs), most of which are believed to be so small that they 
do not reflect enough light back to Earth to be directly observed. The purpose of 
the TAOS project is to estimate the number of these KBOs down to the typical 
size of cometary nuclei (a few kilometers) by observing occultations. The idea 
of the occultation technique is simple to describe. One monitors the light from a 
collection of stars that have angular sizes smaller than the expected angular sizes 
of comets. An occultation is manifested by detecting the partial or total reduction 
in the flux from one of the stars for a brief interval when an object in the Kuiper 
Belt passes between it and the observer. Four dedicated robotic telescopes will au- 
tomatically monitor 2000—3000 stars every clear night for several years and their 
combined results will be used to test for an occultation of each star approximately 
every 0.20 seconds, yielding on the order of 10!! tests per year. The number of 
occultations expected per year ranges from tens to a few thousands, depending on 
what model of the Kuiper Belt is used. Having conducted a large number of tests, 
it is then of interest to estimate the number of occultations, or the occultation rate, 
since this will provide information on the distribution of KBOs. Note that in this 
context we are not so much interested in which particular null hypotheses are false 


Received October 2004; revised March 2005. 
AMS 2000 subject classifications. Primary 62H15; secondary 62715, 62P35. 
Key words and phrases. Hypothesis testing, multiple comparisons, sparsity 


373 


374 N. MEINSHAUSEN AND J. RICE 


as in how many are. The TAOS project was further described by Liang et al. [8] 
and Chen et al. [3]. 

We will base our analysis on the distribution of the p-values of the hypothesis 
tests. Let {Gg, 0 € €) be some family of distributions, where @ is possibly infinite- 
dimensional and Go(t) =t with 0 € © is the uniform distribution on [0, 1]. All 
p-values are assumed to be independently distributed according to 


P, ~ Ga,, Pee lon 


If a null hypothesis is true, the distribution of its p-value is uniform on [0, 1] and 
Pi ~ Go. We suppose that neither the family {Gg (t), 9 € ©} nor the parameter vec- 
tor (61,..., 04) is known, except from the fact that Go corresponds to the uniform 
distribution. 

The proportion of null hypotheses that are false (the fraction of occultations in 
the TAOS example) 1s denoted by 


(1) A=n! 51, Æ 0}. 
ism] 


Our goal is to construct a lower bound À with the property 
(2) P(A <A)>1-a 


for a specified confidence level 1 — a. Such a lower bound would allow one to as- 
sert, with a specified level of confidence, that the proportion of false null hypothe- 
ses is at least A. The global null hypothesis that there are no false null hypotheses 
can be tested at level a by rejecting when À » 0. 

Our construction is closely related to that by Meinshausen and Bühlmann [9], 
which treats the case of possibly dependent tests, but with an observational struc- 
ture that allows the use of permutation arguments that are not available in our case. 
Another estimate was examined by Nettleton and Hwang [10], but it does not have 
a property like (2). Our methodology is related to that of controlling the false dis- 
covery rate [1, 13], but the goals are different —we are not so much interested in 
Which particular hypotheses are false as in how many are. However, we note that 
an estimate of the number of the false null hypotheses can be usefully employed 
in adaptive control of the false discovery rate [2]. In a modification of the original 
FDR method, Storey [13] also estimated the proportion of false hypotheses. The 
empirical distribution of p-values was used by Schweder and Spjgtvoll [11] to es- 
timate the number of true null hypotheses; the methods used there are different 
than ours and do not provide explicit lower confidence bounds. The methods in 
this paper extend a proposal of Genovese and Wasserman [7]. We also relate our 
results to those of Donoho and Jin [6]. 


PROPORTION OF FALSE NULL HYPOTHESES 375 


2. Theory and methodology. The estimate hinges on the definition of bound- 
ing functions and bounding sequences. 

Let U be uniform on [0,1]. Let U,(t) be the empirical cumulative distribu- 
tion function of n independent realizations of a random variable with distribution 
U. For any real-valued function (t) on [0, 1] which is strictly positive on (0, 1), 
define V, s as the supremum of the weighted empirical distribution 


U,(t) —t 
(3) E T s 
re(0,1) ôC) 


DEFINITION 1. A bounding function ó(t) is any real-valued function on [0, 1] 
that is strictly positive on (0, 1). A series n,a is called a bounding sequence for a 
bounding function $(t) if, for a constant level æ: 


(a) nBn.q is monotonically increasing with n; 
(b) P(V4.s > Buia) < o for all n. 


The definition of a bounding sequence depends neither on the unknown pro- 
portion of false null hypotheses nor on the unknown distribution G (t) of p-values 
under the alternative. 

One is interested in the case where a proportion A of all hypotheses are false 
null hypotheses. Denote the empirical distribution of p-values by 


n 
(4) P)n t Y LP, <t}. 
izl 

Estimating the proportion of false null hypotheses can be achieved by bounding the 
maximal contribution of true null hypotheses to the empirical distribution function 
of p-values. We give a brief motivation. Suppose for a moment that there are only 
true null hypotheses. The expected fraction of p-values less than or equal to some 
t € (Q, 1) equals, in this scenario, U (t) = t. The realized fraction U, (t) is, on the 
other hand, frequently larger than t. However, using Definition 1, the probability 
that U, (t) is larger than t + n.a (t) is bounded by o simultaneously for all values 
of t € (0, 1). The proportion of p-values in the given sample that are in excess of 
the bound £ + Bn 46(t) can thus be attributed to the existence of a corresponding 
proportion of false null hypotheses and F,(t) — t — Bn aô (t) is hence a low-biased 
estimate of A. As the bound for the contribution of true null hypotheses holds 
simultaneously for all values of t € (0, 1), a lower bound for A is obtained by 
taking the supremum of F,(t) — t — B4,45(t) over the interval (0, 1). A refined 
analysis shows that an additional factor 1/(1 — t) can be gained when estimating 
the proportion of false null hypotheses. 


DEFINITION 2. Let n,a be a bounding sequence for ó(t) at level a. An esti- 
mate for the proportion A of false null hypotheses is given by 
^ Fat) — t — ô(t 
ib i= sup POTE Brad) 
te(0,1) Lot 


376 N. MEINSHAUSEN AND J. RICE 
This estimate is indeed a lower bound for A, as shown in the following theorem. 
THEOREM 1. Let By be a bounding sequence for 8(t) at level a and let À 
defined by (5). Then 
(6) P(xA)z1-—a. 
PROOF. The distribution of p-values Fa is bounded by F,(t) <+ (1— 


A)Un (t), where no = (1 — A)n and Uno (t) is the empirical distribution of ng inde- 
pendent Uniform(0, 1)-distributed random variables. Thus 


(7) P >2) = P( sup A A PR ua] 


tc(0,1) 1 —t 
(8) P( sup (1 — A)(Ung (t) ~~ t) — Pn aô (t) > 0) 
1€(0,1) 
n 
(9) = P( sup Uno(t) — t — — Pn aô (t) > 0). 
te(0,1) no 


Whereas nÊn,a is monotonically increasing, nÊn,a/no = Bno,« and the proof fol- 
lows by property (b) in Definition 1. L] 


2.1. Asymptotic control. Instead of finite-sample control, it is sometimes more 
convenient to resort to asymptotic control. A sequence Bn œ is said to be an as- 
ymptotic bounding sequence if f, a satisfies condition (a) from Definition 1 and, 
additionally, a modified condition (b^), 


(10) lim sup P (Vn, > Bia) «a, 
H—- oO 


where V,, 5 is defined as in (3). If we suppose that the absolute number of false 
null hypotheses nA is growing with n, that is, nA — oo for n — oo, then for an 
asymptotic bounding sequence, 


limsup P(A <A) 2 1—a. 
Hn—- OO 


Asymptotic control is typically useful in the following situation. For a given 
bounding function ó(t) and two sequences an, bn, consider weak convergence of 


(11) Voss ee i 


to a distribution L. Any sequence fj , that satisfies the monotonicity condition (a) 
of Definition 1 and, additionally, B, œ > a; ! (L^! (1 — o) + bn), is thus an asymp- 
totic bounding sequence at level o. 

As an important example, consider the bounding function é(t) = Yt(1 — t). 
The following lemma is due to Jáschke and can be found in [12], page 599, Theo- 
rem 1 (18). 


PROPORTION OF FALSE NULL HYPOTHESES 377 
LEMMA 1. Let a, = J/2nloglogn and b, = 2loglogn + 5 loglog logn — 
5 log 4z. Then 
an SUP ET s 25 E?, 
t€(0,1) At — 1r) 
where E is the Gumbel distribution E(x) = exp(— exp(—x)). 


(12) 


REMARK 1. The convergence in (12) is in general slow. Nevertheless, the 
result is of interest here. First, the number of tested hypotheses is potentially very 
large (e.g., 10! in the TAOS setting described in the Introduction). Moreover, 
the slow convergence is mainly caused by values of t that are of order 1/n. The 
expected value of the smallest p-value of true null hypotheses is at least 1/7 and it 
might be useful to truncate in practice the range over which the supremum is taken 
in (5) to (1/n, 1 — 1/n). Doing so, the following asymptotic results are still valid, 
while the approximation by the Gumbel distribution is empirically a good fit even 
for moderate values of n [6]. 


similar weak convergence results for other bounding functions can be found in 
[4] or [12]. 


2.2. Bounding functions. The estimate is determined by the choice of the 
function ó(t), the so-called bounding function, and a suitable bounding sequence. 

There are many conceivable bounding functions. Bounding functions of partic- 
ular interest include: 


— linear bounding function ô(t) — t; 
— constant bounding function 6(t) = 1; 
— standard deviation—proportional bounding function 6(t) =./t(1 — t). 


The linear bounding function is closely related to the false discovery rate (FDR), 
as introduced by Benjamini and Hochberg [1]. In the FDR setting, the empirical 
distribution of p-values is compared to the linear function t/a. The last down- 
crossing of the empirical distribution over the line t/a@ determines the number of 
rejections that can be made when controlling FDR at level œ. It is interesting to 
compare this to the current setting. In particular, it follows by a result of Daniels [5] 
that 


P( sup U,,(t)/t > a) = LA; 
ré(0,1) 

The optimal bounding sequence at level œ is thus given for the linear bounding 
function by n a = 1/a — 1. Let À be the estimate under the linear bounding func- 
tion. The estimate vanishes hence, that is, À- 0, if and only if no rejections can 
be made under FDR control at the same level. Note that the bounding sequence is 


378 N. MEINSHAUSEN AND J. RICE 


independent of the number of observations. This leads to weak power to detect the 
full proportion A of false null hypotheses when the proportion A. is rather high but 
the distribution of p-values under the alternative deviates only weakly from the 
uniform distribution, as shown in an asymptotic analysis below. 

An estimate under a constant bounding function was already proposed by 
Genovese and Wasserman [7]. Using the Dvoretzky-Kiefer-Wolfowitz (DKW) in- 
equality, a bounding sequence 1s given by Bia ES + log 2, In contrast to the linear 
bounding function, this bounding function sequence vanishes for n — oo. How- 
ever, the estimate is unable to detect any proportion of false null hypotheses that is 
of smaller order than ./n. The intuitive reason is that the bounding function ô(t) is 
not vanishing for small values of t. Any evidence from false null hypotheses, how- 
ever strong it may be, is hence lost if there are just a few false null hypotheses. 

As already argued above, a bounding sequence for the standard deviation- 
proportional bounding function is given by 


(13) Brox =a; (E — a) + by), 


where E is the Gumbel distribution and an, b, are defined as in Lemma 1. Note 
that the bounding sequence 1s vanishing at almost the same rate as for the constant 
bounding function. In contrast to the constant bounding function, however, the 
standard deviation—proportional bounding function vanishes for small t. It will be 
seen that the standard deviation—proportional bounding function possesses optimal 
properties among a large class of possible bounding functions. 


2.3. Asymptotic properties of bounding sequences. Faced with an enormous 
number of potential bounding functions, it is of interest to look at general prop- 
erties of bounding functions, especially the asymptotic behavior of the resulting 
estimates. The asymptotic properties turn out to be mainly determined by the be- 
havior of 6(t) close to the origin. 


DEFINITION 3. For every v € [0, 1], let Q, be a family of real-valued func- 
tions on [0, 1]. In particular, 8(t) € Q, iff: 
(a) S(t) is nonnegative and finite on [0, 1] and strictly positive on (0, 1); 
(b) 8(1— £) > 6@) fort € (0, 5); 
(c) the function (t) is regularly varying with power v, that is, 
ODD 
t0 d(t) i 


V 


Most bounding functions of interest are members of Q, for some value of 
v € [0, 1]. The constant bounding function is a member of Qo, while the linear 
bounding function is a member of Q; and the standard deviation—proportional 
bounding function is a member of Q1/2. 


PROPORTION OF FALSE NULL HYPOTHESES 379 


It holds in general for any bounding function that bounding sequences can- 
not be of smaller order than the inverse square root of n. In particular, note that 
by Definition 1 of a bounding sequence, it has to hold for any ¢ e (0, 1) that 
P (Us (t) — t — Bn aô (t) > 0) «a for all n € N. Whereas nU, (t) ~ B(n, t) is bino- 
mially distributed with mean nt and variance proportional to n, it follows indeed 
that 


lim inf n^ fy a > 0. 


Consider now bounding functions ô(t), which are members of Q, with some 
vE d, 1]. It follows directly from Theorem 1.1(iii) in [4], page 255, that a more 
restrictive assumption has to hold in this case, namely 


(14) lim inf n!^" f, 4 > 0. 
noo 


For v = 1 this amounts to lim infy..00 n,a > 0. The linear bounding function is 
a member of Q1, explaining the lack of convergence to zero of the corresponding 
optimal bounding sequence 1/« — 1. 

For bounding functions é(t) € Q, with v e [0, 1], there exists some constant 
c > 0 so that có(t)? > t(1 — t). Hence, using Lemma 1, there exist bounding se- 


quences so that 
1/2 
} Baia < OO. 


The different asymptotic behavior of the bounding sequences influences the as- 
ymptotic power to detect false null hypotheses, as will be seen subsequently. 





(15) lim sup( 


n-oo 


loglogz 


3. Power. We examine the influence of the bounding function ^(f) on the 
power to detect false null hypothesis. For simplicity of exposition, it is assumed 
that the p-values of all false null hypotheses follow a common distribution G, 
while p-values of true null hypotheses have a uniform distribution on [0, 1]. For 
some y € (0, 1), let 


Aon Y. 


A value of y — 0 corresponds to a fixed proportion of false null hypotheses, while 
y = 1 corresponds to a fixed absolute number of false null hypotheses. Here all 
cases between those two extremes are considered. 


Bounding sequences with vanishing level. For the asymptotic analysis, it is 
convenient to let æ = a, decrease monotonically for n — oo, so that a, — 0 for 
n — oo. Note that a, — 0 is equivalent to P(V, 5 > Bnia,) — 0 for n > oo. For 
notational simplicity, this assumption is strengthened slightly to 


(16) Vn.5/Bniay 0, n> co. 


380 N. MEINSHAUSEN AND J RICE 


In almost all cases of interest, (16) and a, — 0 are equivalent. To maintain reason- 
able power, one would like to avoid letting the level o, vanish too fast as n — oo. 
For bounding functions 6(t) € Q, with v c [0, i] it 1s required that 





"E 
) Pn,a, < OO. 


17 li 
(17) msup( —— 


A-> OO 
It follows from (15) that it is always possible to find a sequence a, — 0 so that 
both (16) and (17) are satisfied. If both (16) and (17) are satisfied, the sequence 
o, is Said to vanish slowly. For bounding functions (f) € Q, with v € (1/2, 1], it 
will be seen below that the power is poor no matter how slowly the sequence a,, 
vanishes for n — 0. 


3.1. Case I: many false null hypotheses, y € [0, 1). The fluctuations in the 
empirical distribution function are negligible compared to the signal from false 
null hypotheses if y € [0, 3). Hence one should be able to detect (asymptotically) 
the full proportion of false null hypotheses in this first setting. 

This is indeed achieved, as long as we look for bounding functions in Q, with 
v € [0, 1], as shown below. If on the other hand v € (4, 1], one is in general unable 
to detect the full proportion of false null hypotheses. The proportion of detected 
false null hypotheses even converges in probability to zero for large values of y 
if v is in the range (5, 1]. This includes in particular the linear FDR-style bound- 
ing function t € Q1, which is only able to detect a nonvanishing proportion of 
false null hypotheses (asymptotically) as long as the proportion A is bounded from 
below, which is only satisfied for y = 0. 


THEOREM 2. Let G be continuous and let inf;e(o,1j G' (t) = 0. Let À. be the 
estimate under bounding function Bn S(t), where (t) € Q, with v € [0, 1] and 
Bn.a is a bounding sequence. If v € [O, i] and o, vanishes slowly, then, for all 


y €[0, 5), 
À p 
— — Í, n — oo. 
A 
However, for v € (5, 1] and y € (1— v, 3 


REMARK 2. The case infre¢œ,1) G'(t) = 0 corresponds to the "pure" case 
in [7]. If infze(o,1) G'(t) > 0, the results above (and below) hold if A is replaced 


by 
is(1- inf Ge. 


t€(0,1) 


PROPORTION OF FALSE NULL HYPOTHESES 381 


Without making parametric assumptions about the distribution G under the alter- 
native, identifying A is indeed the best one can hope for. 


The message from Theorem 2 is that one should look for bounding functions in 
Q, with v € [0, 1]. This guarantees proper behavior of tbe estimate if the propor- 
tion A of false null hypotheses is vanishing more slowly than the square root of the 
number of observations. 


3.2. Case II: few false null hypotheses, y € [5 1). As seen above, bounding 
functions in Q, with v < 5 detect asymptotically the full proportion A of false 
null hypotheses if A is vanishing not as fast as the square root of the number of 
observations. 

For y > 1/2, no method can detect asymptotically the full proportion of false 
null hypotheses if the distribution under the alternative is fixed. For a fixed non- 
degenerate alternative, the majority of p-values from false null hypotheses fall 
with high probability into a fixed interval that is bounded away from zero. The 
fluctuations of the empirical distribution function in such an interval are asymptot- 
ically infinitely larger than any signal from false null hypotheses if y > 1/2, which 
makes detection of the full proportion of false null hypotheses impossible. 

It is hence interesting to consider cases where the signal from false null hypothe- 
ses is increasing in strength. Therefore, let G = G™, the distribution of p-values 
under the alternative, be a function of the number n of tests to conduct. The super- 
script is dropped in the following for notational simplicity. 


Shift-location testing. Itis perhaps helpful to think about G as being induced 
by some shift-location testing problem. For each test it is assumed that there is a 
test statistic Z;, which follows some distribution To under the null hypothesis Ho, 
and some shifted distribution 7,,, under the alternative Hj ,: 


Ho,i ` Zi on To, 
(18) 
Ay pi Zi ~ ly. 


In the Gaussian case this amounts, for example, to 7o = M (0,1) and T, = 
N (un, 1). To have an interesting problem, one needs for y € G, 1) in general 
that the shift jz, between the null and alternative hypotheses be increasing for an 
increasing number of tests; that is, 4, — oo for n — oo. 

On the other hand, one would like to keep the problem subtle. For the Gaussian 
case it was shown by Donoho and Jin [6] that an interesting scaling is given by 
Hn = A/2rlogn with r € (0, 1). In this regime, the smallest p-value stems with 
high probability from a true null hypothesis. The false null hypotheses have hence 
little influence on the extremes of the distribution. 


382 N. MEINSHAUSEN AND J. RICE 


Instead of assuming Gaussianity of the test statistics, Donoho and Jin [6] consid- 
ered a variety of different distributions. Under a generalized Gaussian (Subbotin) 
distribution, the density is for some positive value of x proportional to 


/ [x m al“ 
T, x) a exp(— , j 


The case x = 2 corresponds clearly to a Gaussian distribution; x = 1 corresponds 
to the double exponential case. The shift parameter is chosen then as 


(19) lUn = (kr logn)! 





for some r c (0, 1). Note that the expectation of the smallest p-value from true null 
hypotheses vanishes like n^!, whereas under the scaling (19), the median p-value 
of false null hypotheses vanishes like n^" for n — oo with some r c (0, 1). In 
fact, consider for any member of the generalized Gaussian Subbotin distribution 
the q-quantile G^! (q) of the distribution of p-values under the alternative. For 
some constant cg, the q-quantile is proportional to 


K 
G^ (q) x f ew ( —) dx. 
Han Cg K 


Applying I’ Hópital's rule twice, it follows for any c and x > 0 that 


log fore exp(—x* /k) dx B 
ü— 00 —a* /K ii 


l. 


Thus, under the scaling (19), for any every q € (0, 1) and positive x, the scaling of 
the g-quantile is given by 


(20) log Gq) ~ —r logn. 


With probability converging to 1 for n — oo, a p-value under a false null hypoth- 
esis is hence larger than the smallest p-value from all true null hypotheses as long 
as r € (0, 1). For r > 1, the problem gets trivial as the probability that an arbi- 
trarily high proportion of p-values under false null hypotheses is smaller than the 
smallest p-value from all true null hypotheses converges to 1 for n — oo. 

The point of introducing the shift-location model under generalized Gaussian 
Subbotin distributions was just to identify (20) with r € (0, 1) as the interesting 
scaling behavior of quantiles of G, the p-value distribution for alternative hypothe- 
ses. The setting (20) is potentially of interest beyond any shift-location model. We 
adopt the scaling (20) for the following discussion without making any explicit 
distributional assumptions about underlying test statistics. 


THEOREM 3. Let X^ n^ with y € [4, 1) and let the distribution G of 


p-values under the alternative satisfy (20) for some r € (0, 1). Let À be the es- 
timate of à under a bounding function By «5 (t), where 8(t) € Qy with v € [0, 5] 


PROPORTION OF FALSE NULL HYPOTHESES 383 


and f, a, is a bounding sequence for 5(t). Let a, vanish slowly. If r > Ly — j. 


(21) 


If, on the other hand, r < (y — 2) then 
À 
(22) 5 o. 


REMARK 3. The analysis was only carried out for functions with v € [0, i] 


due to the deficits of the functions with v € (4, 1] discussed in the previous section. 
Nevertheless, it would be possible to carry out the same analysis here. For v — 1, 
one obtains, for example, a critical boundary r > y. 


The MUR dA from the last theorem is that among all bounding functions in Q, 
with v € [0, 5], it is best to choose a member of Q;/2. Bounding functions in 
Q1;2 increase the chance to detect the full proportion 4 of false null hypotheses, 
as illustrated for a few special cases in Figure 1. The area in the (r, y) plane where 
À /A converges in probability to 1 for a bounding function in Q1/2 includes in 
particular all areas of convergence for bounding functions in Q, with v c [0, 5]. 


3.3. Connection to the familywise error rate. A different estimate of À is ob- 
tained by controlling the familywise error rate (FWER). In particular, let the esti- 
mate be the total number of p-values less than the FWER threshold a/n, divided 
by the total number of hypotheses, 


This is an estimate of A with the desired property P(A > A) < a. Controlling the 
familywise error rate has often been criticized for lack of power. Indeed, in the 


LETT HHI 


10 12 


06 08 10 12 
T 
D8 08 


yu vix] 


00 OZ O4 66 08 10 12 
co G2 64 08 G8 10 12 


00 02 04 
00 O02 04 


oc 02 04 06 OB 10 00 02 04 06 098 10 00 02 04 08 08 10 00 02 04 26 08 10 
Y Y Y T 


FIG. 1. For v =0 (left), v = 1/2 (second from left) and v = 1 (second from right), an illustration 
of the asymptotic properties of the estimate À. The shaded area marks those areas in the (r, y) plane 
where À/A — p 1, whereas for the white areas A/a — p 0. The choice v = 1/2 is seen to be optimal. 
The corresponding plot for control of the familywise error rate is shown on the right for comparison. 


384 N MEINSHAUSEN AND J. RICE 


asymptotic analysis above it is straightforward to show that the area in the (r, y) 
plane where À Jà — p 1 is restricted to the half-plane r > 1 (neglecting again what 
happens directly on the border r = 1). In comparison to other estimates proposed 
here, the familywise error rate is hence particularly bad for estimating A if there 
are many false null hypotheses, each with a very weak signal. In addition, the con- 
struct requires that p-values can be determined accurately down to precision a/n, 
which might be prohibitively small. In contrast, the performance of estimates of 
the form (5) does not deteriorate significantly if p-values are truncated at larger 
values. 

The drawbacks of the familywise error rate are a consequence of the stricter 
inference one is trying to make when controlling the familywise error rate. In par- 
ticular, one is trying to infer exactly which hypotheses are false nulls as opposed 
to only how many false nulls there are in total. The loss in power is hence the price 
one pays for this more ambitious goal. 


3.4. Connection to higher criticism. A connection of the proposed estimate 
to the higher criticism method of Donoho and Jin [6] for detection of sparse het- 
erogeneous mixtures emerges. In their setup p-values P,,i = 1,...,n, are i.d. 
according to a mixture distribution 


P,~ (1 —2)H -- AG, 


where H is the uniform distribution and G the distribution of p-values under the 
alternative hypothesis. In [6] the focus is on testing the global null hypothesis that 
there are no false null hypotheses at all, 


Ho:À — 0. 


In contrast, in this current paper we are interested in quantifying the proportion A 
of false null hypotheses. The proportion A of false null hypotheses, as defined for 
the current paper in (1), can be viewed as a realization of a random variable with 
a binomial distribution, nA ~ B(n, A). For the asymptotic considerations of this 
paper, however, the distinction between À and A is of little importance because the 
ratio A/A converges almost surely to 1 for n — oo. 

The two goals of higher criticism and the current paper are connected. If there 
js evidence for a positive proportion of false null hypotheses with the proposed 
method, then the global null Ho can clearly be rejected. In other words, if one ob- 
tains a positive estimate À > 0 with P(A > A) <a, then the global null hypothesis 
Ho:À = 0 can be rejected at level œ. Note that the level is correct even for finite 
samples and not just asymptotically. 

The connection between the two methods works as well in the reverse direc- 
tion if an optimal bounding function is chosen. It emerged in particular from the 
analysis above that bounding functions that are members of Q1,2 have optimal as- 
ymptotic properties. For the particular choice of a standard deviation-proportional 


PROPORTION OF FALSE NULL HYPOTHESES 385 


bounding function in Q172, let À be an estimate of A and let n,a be a bounding 
sequence that satisfies 


Bn. =n '/*(2Joglogn)!/?(1 + o(1)). 


Donoho and Jin [6] are not specific about choice of a critical value for higher 
criticism. However, choosing «s/n n,a as a critical value meets their requirements. 
The higher criticism procedure rejects in this case if and only if the estimate À of 
the proportion of false null hypotheses is positive, 


(reject Ho: X = 0 with higher criticism) = {A > 0]. 


If both à ^ n^Y and à ^ n^* for some y € [0, 1], the question arises if the area in 
the (y, r) plane where 


(23) P (higher criticism rejects Ho) — 1 
is identical to the area where 
À p 
24 vam 1. 
( d À 


Intuitively, it is clear that it is somewhat easier to test for the global null hypoth- 
esis Ho: À = 0, as done in higher criticism, than to estimate the precise proportion 
À. of false null hypotheses, as done in this paper. One would therefore expect that 
the area of convergence in the (y, r) plane of (23) includes the area of convergence 
of (24). 

It is hence maybe surprising that for some cases the areas of convergence in 
the (y, r) plane of (23) and (24) agree. To illustrate the point, consider again the 
shift-location model (18) under a generalized Gaussian Subbotin distribution with 
parameter x € (0, 2) and a shift (19) of test statistics under the alternative. 

The area in the (y, r) plane where À [à — p 1 is in this setting independent of the 
parameter x. The detection boundary for higher criticism, however, does depend 
on x. For the Gaussian case (x = 2) and in general for x > 1, the detection bound- 
ary for higher criticism is, for y € (1/2, 1), below the area where AJA —> p 1. The 
reason for this is intuitively clear. The higher criticism method looks in these cases 
for evidence against Ho in the extreme tails of the distribution G; see [6]. At these 
points, only a vanishing proportion of all p-values from false null hypotheses can 
be found. If one is trying to estimate the full proportion of false null hypotheses, 
the evidence for a certain amount of false null hypotheses has to be found at less 
extreme points, where one can expect a significant proportion of p-values from 
false null hypotheses. This limits the region of convergence in the sense of (24) 
compared to the area where higher criticism can successfully reject the global null 
hypothesis Hp : A = 0. 

However, for « < 1 (including thus the case of a double-exponential distribu- 
tion) the two areas where (23) and (24) hold, respectively, are identical, as shown 


386 N. MEINSHAUSEN AND J. RICE 





00 02 04 06 08 10 12 
r 
00 02 04 06 08 10 12 


00 02 04 06 08 10 00 02 04 08 08 1.0 
Y Y 


FIG. 2. Comparison between the estimate of à and detection regions under higher criticism if 
test statistics follow the location-shift model (18) and are distributed according to the generalized 
Gaussian Subbotin distribution with shift parameter (19). The shaded area in the left panel shows 
again the area of convergence in probability of A/a to 1 for a bounding function in the class Q1. 
The shaded area in the right panel corresponds to the region where higher criticism can reject as- 
ymptotically the null hypothesis Ho : X = 0 for x < 1, including the double-exponential case. The line 
below marks the detection boundary for the Gaussian case (x z 2). 


in Figure 2. In the white area, both higher criticism and the current method fail 
to detect (asymptotically) the presence of false null hypotheses, and not even the 
likelihood ratio test 1s able to reject in these cases (asymptotically) the global null 
hypothesis Hg: À = O that there are only true null hypotheses [6]. It is hence of 
interest to see that for « < 1, À /^ — y 1 holds whenever the likelihood ratio test 
succeeds (asymptotically) in rejecting the global null hypothesis. 


4. Numerical examples. It emerged from the analysis above that the standard 
deviation-proportional bounding function is optimal in an asymptotic sense. In the 
following discussion we briefly compare various bounding functions for a moder- 
ate number of tests, n — 1000. The setup is identical to the shift-location testing 
of Section 3.2, equation (18). For true null hypotheses, test statistics follow the 
normal distribution NW (0, 1). For false null hypotheses, test statistics are shifted by 
an amount u > 0 and are N (u, 1)-distributed. 

The proportion À/A of correctly identified false null hypotheses is computed for 
various values of the shift parameter 4, and three bounding functions. The results 
for 100 simulations are shown in Figure 3. The left column shows results for very 
few false null hypotheses (A = 0.01), corresponding to 10 false null hypotheses, 
while results are shown in the right column for a moderately large number of false 
null hypotheses (A = 0.2). 

For very few false null hypotheses (A = 0.01), both the standard deviation— 
proportional and linear bounding functions identify a substantial proportion of 
false null hypotheses if the shift uw is larger than about 3. The expected value of 
the largest test statistic from true null hypotheses is, for comparison in the current 


PROPORTION OF FALSE NULL HYPOTHESES 387 


12 


A 
00 04 08 12 00 04 Ot 


£A 
00 04 OB 12 





FIG. 3. The proportion 4/4 of correctly detected false null hypotheses as a function of the separa- 
tion u. Results are shown for the standard deviation-proportional bounding function (top row), the 
constant bounding function (middle row), and the linear bounding function (bottom row). 


setup, at around 3.7. The constant bounding function (v — 0) fails to identify any 
of the 10 false null hypotheses even for very large shifts jz. This is in line with the 
theoretical results from Section 3.2. For a moderately large number of false null 
hypotheses (A = 0.2), the performance of the linear bounding function is worse 
than for the other two bounding functions, as expected from the asymptotic results 
in Section 3.1. The standard deviation-proportional bounding function (v = 1/2) 
in both cases consistently identifies the most false null hypotheses, and the opti- 
mality of this bounding function is thus numerically evident for moderate sample 
sizes as well. 

For the standard deviation-proportional bounding function (v = 1/2), asymp- 
totic control was proposed in (10). The result relies on convergence of the 
supremum of a weighted empirical distribution to the Gumbel distribution. This 
convergence is in general slow, as already mentioned in Remark 3. The conver- 
gence is comparably fast, however, if the region over which the supremum is taken 
is restricted to, say, (1/n, 1 — 1/n), as observed by Donoho and Jin [6]. We illus- 
trate this in the following text. Restricting the interval over which the supremum 
is taken in (5) to some interval (a, b) with 0 « a « b « 1, bounding sequences can 
be defined analogous to Definition 1 by requirement (b) in Definition 1 and 


(25) Bra = minl g: P( sup EL >f) <a]. 


Bounding sequences for the interval (0, 1) satisfy (25) for every interval (a, b), but 
might be unduly conservative. Less conservative bounding sequences can be found 


388 N. MEINSHAUSEN AND J RICE 


(Un(t) -0/8(0 
000 005 010 015 
010 015 


0 05 





-0 10 





200 500 1000 2000 5000 10000 


t n 


FIG. 4 Random samples of the weighted empirical distribution function (Us(t) — 1)/8(t) with 
8(0) = J/t (1 — t) on the left. Various bounding sequences Bn aœ as a function of n in log-log scale on 
the right’ the asymptotically valid bounding sequence (solid line), and the bounding sequences for 
the intervals (0, 1) (dotted line), (1/n, 1 — 1/n) (upper dashed line) and (1/n, 0.01) (lower dashed 
line), as obtained by simulation. Note that the latter two are almost indistinguishable. 


conveniently by approximating the probability of sup,eq, 5 (Un(t) — t)/8(t) > B 
with the empirical proportion of occurrence of this event among a large number of 
simulations. This is illustrated in the left panel in Figure 4. Shown are five random 
samples of the the weighted empirical distribution (U, (t) — 1)/d(t) for n = 200 
and S(t) = ./t(1 — t). Let the value 6 correspond to the lower bound of the gray 
area in Figure 4. For an interval (a, b) = (0, 0.4), the event supe, p (Us (1) — 
t)/5(t) > B corresponds then to the event that a realization of a weighted empirical 
distribution crosses the gray area. The bounding sequences obtained by using 1000 
simulations of the weighted empirical distribution are shown in the right panel in 
Figure 4 for various intervals (a, b). 

There are two main conclusions. First, one might suspect that p-values from 
false null hypotheses are mostly found in a neighborhood around zero. Restricting 
the region in (5) to such a neighborhood promises thus to capture all p-values from 
false null hypotheses while allowing for smaller bounding sequences. However, 
the numerical results suggest otherwise. The bounding sequence for the region 
(1/n, 0.01) is, for example, almost indistinguishable from the bounding sequence 
for the region (1/n, 1 — 1/n), as can be seen in Figure 4. 

Second, the agreement of the asymptotically valid bounding sequence (13) with 
the bounding sequence that is obtained by simulation for the interval (1/n, 1 — 
1/n) is very good even for moderate sample sizes, while the agreement is not so 
good for the interval (0, 1). When using the asymptotically valid bounding se- 
quence it is hence advisable to restrict the region over which the supremum is 
taken in (5) to (1/n, 1 — 1/n). This ensures that the true level is close to the cho- 
sen level a for moderate sample sizes. 


PROPORTION OF FALSE NULL HYPOTHESES 389 


For practical applications, we hence recommend that one calculate the supre- 
mum in (5) over a region (1/n,1 — 1/n) and use the standard deviation- 
proportional bounding function with the asymptotically valid bounding se- 
quence (13). The asymptotic results of the previous sections hold for this modified 
procedure. 


5. Proofs. 


PROOF OF THEOREM 2. First it is shown that, as long as y € (0, 4) and v < 5, 
for any given € > Q, 


(26) P(X«ü- £)A) + 0, n — oo. 


Let the empirical distribution of p-values be defined as in (4) by F,(t) = 
n 1 Y? HP; x t). We suppose that the proportion of false null hypotheses is 
fixed at A, so that F,(t) is a mixture F,(t) = AG», (t) + (1 — ADU, (t), where 
Gy, (t) is the empirical distribution of n; = An ii.d. p-values with distribution 
G and U,,(t) is the empirical distribution of no = (1 — A)n iid. p-values with 
uniform distribution U. For any t < 1, 


(27) = sup Fut) — t — Brand) 
t€(0,1) l1 —t 
Q8) SM) ST y FRO UP UD. 
bet 1—t 
(29) E Tm E: 2- | 


Whereas inf;<(0,1) G'(t) = 0 and, hence, supeo, i (GO) — t)/(1 — t) = 1, there 
exists by continuity of G(t) some t; so that (G(tj) — t))/(1 — tj) > (1 — e/2). 
Setting € = 5( 1 — t))e, it suffices to show that for every € > 0, 


P (Bn,a,8(11) + F(t) — Fat) > £X) > 0, n — oo. 


Whereas Fy (t1) — F(ti) = Op(n^ V?) and A  n^* with y < 4, this follows from 
the finiteness of (t) and, because o, vanishes slowly, from (17). This completes 
the first part of the proof of Theorem 2. 

For the second part, it suffices to show that for v € (4, IJ] andy € (1 — v, 5), 
and any € > 0, 


(30) P (À > £X) — 0, n — oo. 


In this regime, the penalty 6, 4, 6(t) is asymptotically larger than the signal from 
false null hypotheses. Using the definition of A, the notation no = (1 — A)n and 


390 N. MEINSHAUSEN AND J. RICE 


= àn, and F,(t) = AGj, (t) + (1 — A)U ng (£), it follows that 


n F4,(t) — t — à 
PGi > en) = P( sup TERES vaa] 
t€(0,1) Ig 
Gn (t) —t Una (t é(t 
-»( sup ct Ti B — £A + (1— 9-889 - Bn an —— — 2 > 0) 
te(0,1) 1-t —1 1] — 
t) —t (t 
(31) «»( sup AEn cgo Pete O50) 
t€(0, 1) 1—t 2 l-t 
Un, (t ó(t 
(32) +P( sup a-a REE fra E 5), 
te(0,1) 1] —t 2 l-t 


Observe in (32) that (1 — A) ^! Bre, = nfi o, / 10 = Broan = Bno,dny- Thus (32) 
can be bounded by P(V4,.5 > Bro,ctng /2). By (16) and no — oo it follows that (32) 
vanishes for n — oo. It remains to show that (31) vanishes as well. Let t) = sup{t € 
(0, 1): G(t) x €/2}. Using Bonferroni's inequality, (31) is bounded by 





nı M) — 
(33) P( sup AEn 24 >0) 
ten] Lt 
(34) + P( qup. ee Ret OY >0) 
t€(t2,1) ]—t 2 l-t l 


Whereas (Gn, (t) — t)/(1 — t) < Gn, (t) for all t € [0, 1], the first term (33) is 
bounded by P(G,, (f2) > €), which vanishes for n — oo because by definition 
of fo, G(t3) < e/2 and n; = àn — oo. Using G,,(t) < 1, the second term (34) 
equals zero if B, o, infre(o 1) 6(£)/(1 — t) > 24. By conditions (a) and (b) in Def- 
inition 3, it holds that inf;cq,,1) 9 (£)/(1 — t) > 0. By (14), it follows furthermore 
that B, a, /A — oo for n — oo, which completes the proof. C] 


PROOF OF THEOREM 3. First it is shown that for r > 1(y — 5), 
(35) P(i<(1—e)A)>0, n>. 


Here the penalty is again asymptotically larger than the signal from false null hy- 
potheses for a fixed point t € (0, 1). However, because the signal from false null 
hypotheses is increasing in strength for larger n, the evidence for a certain amount 
of false null hypotheses can be found at decreasing values of f. Using the de- 
finition of À, for any t € (0, 1), A> Fr (t) — t — Bn, (t) and, hence, for any 
t € (0, 1), 


x 1—A 1 
Afr sed > (1 = Gr, (t)) ad ame E A Uno (t)) T 3 Band (D), 


PROPORTION OF FALSE NULL HYPOTHESES 391 


where again n; = An and ng = (1 — A)n. Choosing tn, =n "t" for some 0 < 
t <r — 1(y — 4), observe that by (20) it follows that 1 — G(n^"**) = o(1). 
Hence 


ÂA — 1 > (1— Gt) — |G (r,t) — Gn: (tn,x)| 
— fnt — i? his oe no (tn,t)| zs ic Baans (in, r) 
= o(1) — op (1) — o(1) — O p(n” 0/?*€79/2) 
= On ay logn). 


The proof of (35) follows because y « $ +yu(ir—-t)< j + =, 

Second, it has to be shown that P(A > Eà) > Oifr< L(y -— $). Again, the 
evidence for a certain amount of false null hypotheses would have to be found 
at decreasing values of t. However, the decrease has to be so fast in this regime 
that the signal from false null hypotheses is not captured. Using again the notation 
n; = Àn and no = (1 — A)n, we find that A = SUP; <(0,1) Dn, (f), where 


ÀA(Gn, (t) m t) T (1 n A) (Uno (t) "vs t) bd Pn,a, 9 (t) l 


(36) D, (t) i= 1—1 


Choose a sequence tn, =n ' ^? for some 0 < p < iy — x) — r. The regions 
(0, tno] and (£4, 5, 1) are considered separately for the following. In particular, it 
is shown that both P (SUp;eo,;, 01 Dj, X (t) > £X) and P (sup;eq, ,,1) Dn A (f) > 6A) 
vanish for n — oo. For t > typ, it holds that 


P( sup Dna) > 2 


f €(fn, p, 1) 





Un (t) — t S(t 
< P( sup ipee ee g >0) 
tE(tn.p,1) Ies beu 
Una (t) —t Ô 
< P( sup (73) 78 _ Pras IO 0) 
tE (tn 5,1) |—t 2 i-t 
a. Ot 
+1 sup p Pu T so] 
t€(tn.p+1) 2d; 
n Pn,a, 
(37) = P( sup (Us, t) = #) - — Poeta) > 0) 
tEn o, 1) no 2 


(38) Prin SW <i}. 


|. inf 2 l-t 


By (16) and because nf, 4, is monotonically increasing, (37) vanishes for n — oo. 
For (38), because à € Q,, there exists a constant c so that infre, ,,1 9 (£55) = 


392 N. MEINSHAUSEN AND J. RICE 


cn € *?. Tt follows byr+p< L(y — 3) that infre(t, ,.1) Bn,a,9 (£n, 5)/4 — oo 
for n — oo, which completes the first part of the proof. 

It remains to show that P (sup e(0,t | Dai (f) > £X) > 0 for n — oo. It holds 
that | 


P( sup Dn y(t) > er) 


fE(0,fn | 
(39) < P( sup (1— Un 0 7t — Big > =a) 
t€(0,5,5] Deu 1—t 3 
t) — G(t 
(40) T P( sup 10-0 > =] 
t€(0,1,. 5] Ls 3 
G(t) —t 
(41) + 1| sup jT > 2j. 
telO] 17t 3 


As already argued above, the probability on the right-hand side of (39) van- 
ishes for n — oo. The probability (40) clearly likewise vanishes and it remains 
to show that (41) vanishes as well for n — oo. Whereas t4,» — O0, it holds that 
(1—4)! <2 for t e (0, fn 5] and large enough values of n. The term (41) van- 
ishes hence if G (tn,p) < e: This is equivalent to log Gu} (&) « —(t + p)logn, and 
the claim follows from property (20). | 


Acknowledgments. The authors would like to thank an anonymous referee, 
the Associate Editor and Jianging Fan for helpful comments, which helped to im- 
prove an earlier version of the paper. Nicolai Meinshausen would also like to thank 
Peter Bühlmann for interesting discussions. 


REFERENCES 


[1] BENJAMINI, Y. and HOCHBERG, Y. (1995). Controlling the false discovery rate. A practi- 
cal and powerful approach to multiple testing. J Roy. Statist. Soc. Ser. B 57 289—300. 
MR1325392 

[2] BENJAMINI, Y and HOCHBERG, Y. (2000). On the adaptive control of the false discovery rate 
in multiple hypothesis testing with independent statistics. J. Educational and Behavioral 
Statistics 25 60—83. 

[3] CHEN, W., ZHANG, Z, KING, S., ALCOCK, C., BYUN, Y., COOK, K., DAVE, R. 
GIAMMARCO, J., LEE, T, LEHNER, M., LIANG, C., LISSAUER, J., MARSHALL, S., 
DE PATER, I , PORRATA, R., RICE, J., WANG, A., WANG, S and WEN, C (2003) Fast 
CCD photometry in the Taiwanese-American Occultation Survey. Baltic Astronomy 12 
568—573. 

[4] CSORGÓ, M and HORVATH, L. (1993). Weighted Approximations in Probability and Statistics. 
Wiley, Chichester. MR1215046 

[5] DANIELS, H. (1945). The statistical theory of the strength of bundles of threads. I. Proc. Roy. 
Soc London Ser. A 183 405—435. MR0012388 

[6] DONOHO, D and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures 
Ann. Statist. 32 962-994. MR2065195 


PROPORTION OF FALSE NULL HYPOTHESES 393 


[7] GENOVESE, C. and WASSERMAN, L. (2004). A stochastic process approach to false discovery 
control. Ann. Statist. 32 1035-1061. MR2065197 
[8] LIANG, C.-L., RICE, J., DE PATER, I., ALCOCK, C., AXELROD, T., WANG, A. and 
MARSHALL, S. (2004). Statistical methods for detecting stellar occultations by Kuiper 
Belt objects: The Tatwanese-American Occultation Survey. Statist. Sci. 19 265—274. 
MR2146947 
[9] MEINSHAUSEN, N. and BUHLMANN, P. (2005). Lower bounds for the number of false null 
hypotheses for multiple testing of associations under general dependence structures. Bro- 
metrika 92 893—907. 
[10] NETTLETON, D. and HWANG, J (2003). Estimating the number of false null hypotheses when 
conducting many tests. Technical report, Dept. Statistics, Iowa State Univ 
[11] SCHWEDER, T. and SPI@TVOLL, E. (1982). Plots of p-values to evaluate many tests simulta- 
neously. Biometrika 69 493—502. 
[12] SHORACK, G. and WELLNER, J. (1986). Empirical Processes with Applications to Statistics. 
Wiley, New York. MR0838963 
[13] STOREY, J. D (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. 
Methodol. 64 479-498. MR1924302 


SEMINAR FUR STATISTIK DEPARTMENT OF STATISTICS 
ETH-ZURICH UNIVERSITY OF CALIFORNIA 
8092 ZURICH BERKELEY, CALIFORN:A 94720 
SWITZERLAND USA 


E-MAIL. nicola1G stat math ethz.ch E-MAIL: rice Gstat.berkeley.edu 


The Annals of Statistics 

2006, Vol 34, No 1, 394-415 

DOT 10 1214/009053605000000778 

© Institute of Mathematical Statistics, 2006 


FALSE DISCOVERY AND FALSE NONDISCOVERY RATES IN 
SINGLE-STEP MULTIPLE TESTING PROCEDURES! 


BY SANAT K. SARKAR 
Temple University 


Results on the false discovery rate (FDR) and the false nondiscovery 
rate (FNR) are developed for single-step multiple testing procedures In ad- 
ditton to verifying desirable properties of FDR and FNR as measures of 
error rates, these results extend previously known results, providing fur- 
ther insights, particularly under dependence, into the notions of FDR and 
FNR and related measures. First, considering fixed configurations of true and 
false null hypotheses, inequalities are obtained to explain how an FDR- or 
FNR-controlling single-step procedure, such as a Bonferroni or Šıdák pro- 
cedure, can potentially be improved Two families of procedures are then 
constructed, one that modifies the FDR-controlling and the other that modi- 
fies the FNR-controlling Šıdák procedure. These are proved to control FDR 
or FNR under independence less conservatively than the corresponding fami- 
hes that modify the FDR- or FNR-controlling Bonferroni procedure. Results 
of numerical investigations of the performance of the modified Sidák FDR 
procedure over its competitors are presented Second, considering a mixture 
model where different configurations of true and false null hypotheses are as- 
sumed to have certain probabilities, results are also derived that extend some 
of Storey's work to the dependence case. 


1. Introduction. The false discovery rate (FDR) and related measures have 
been receiving considerable attention due to their relevance as measures of the 
overall error rate in multiple testing problems that arise in many scientific inves- 
tigations, particularly in the context of DNA microarray analysis. Consider Ta- 
ble 1, which summarizes the outcomes in multiple testing of n null hypotheses 
Hj;,..., Hn. Let Q = V/R if R > 0 and = 0 if R = 0, that is, the proportion 
of false positives (Type I errors) among the rejected null hypotheses. Genovese 
and Wasserman [9] called this the false discovery proportion (FDP). The FDR 
is defined by E(Q). It was first introduced in multiple testing by Benjamini and 
Hochberg [1], who provided a step-up procedure that controls the FDR with in- 
dependent test statistics. Later, Benjamini and Liu [4] offered a step-down FDR 
procedure under independence. The FDR-controlling property of the Benjamini— 
Hochberg (BH) procedure was extended by Benjamini and Yekutieli [5] to some 


Received November 2003, revised March 2005. 
‘Supported by NSF Grant DMS-03-06366 and a 2003 Summer Research Fellowship awarded by 
Temple University. 
AMS 2000 subject classifications. Primary 62115, 62H15, secondary 62H99. 
Key words and phrases Modified Bonferroni and Sidák procedures, mixture model, positive false 
discovery rate, positive false nondiscovery rate. 


394 


FDR AND FNR IN SINGLE-STEP TESTS 395 


TABLE 1 
The outcomes in testing n null hypotheses 


Rejected Accepted Total 


True null V U no 


False null S T n| 


Total R A n 


positively dependent multivariate distributions. Sarkar [14] proved that the criti- 
cal values of the BH procedure can be used in a more general stepwise procedure 
to provide control of the FDR not only under independence, but also when the 
test statistics have the same type of positive dependence property as considered by 
Benjamini and Yekutieli [5]. In addition, he established the FDR-controlling prop- 
erty of the Benjamini—Liu step-down procedure for some positively dependent test 
statistics. Genovese and Wasserman [8, 9] investigated some operating characteris- 
tics of the BH procedure asymptotically under independence and further extended 
the theory of FDR by taking a stochastic process approach. 

A slightly different concept of FDR, called the positive false discovery rate 
(pFDR), was considered by Storey [17]. It is defined as the conditional FDR given 
at least one rejection, that is, pFDR = E(V/R|R > 0), and it has the interpretation 
of a Bayesian Type I error rate under a mixture model involving i.i.d. p-values 
when a single-step multiple testing procedure is used; see also [18]. Storey [17] 
provided estimates of FDR and pFDR under the above mixture model for a single- 
step procedure that are related to the empirical Bayes FDR of Efron, Tibshirani, 
Storey and Tusher [7]; see also [6]. À new family of FDR procedures based on esti- 
mates of FDR was suggested by Storey [17] and Storey, Taylor and Siegmund [19]. 

An analog of FDR in terms of false negatives (Type II errors) was introduced 
by Genovese and Wasserman [8] and Sarkar [15]. It is the FNR, called false 
nondiscovery rate by Genovese and Wasserman [8] and the false negatives rate 
by Sarkar [15]. It is defined by E(N), where N = T/A if A —- 0and —0if A=0 
is the proportion of false negatives among the accepted null hypotheses or the 
false nondiscovery proportion (FNP) [9]. Storey [18] defined the pFNR (positive 
false nondiscovery rate), the conditional expectation E(T/A|A > 0), as an ana- 
log of his pFDR. While Genovese and Wasserman [8] considered new methods 
that incorporate both FDR and FNR, Storey [18] established a connection between 
multiple testing and classification theory in terms of a combination of pFDR and 
pENR. Sarkar [15] proved that the FNR can be controlled by a step-down analog 
of the BH procedure. He also introduced a concept of unbiasedness of an FDR- 
or ENR-controlling multiple testing procedure and established this property for a 
generalized stepwise procedure under independence. 

In this article we mainly concentrate on single-step multiple testing procedures, 
and we develop new results on FDR and FNR with dependent test statistics both 


396 S. K. SARKAR 


under a model where the configuration of true and false null hypotheses is as- 
sumed fixed, yet unknown, and under the so-called mixture model where different 
configurations of true and false null hypotheses are assumed to have certain prob- 
abilities. The intent of these results is to verify some desirable properties of FDR 
and FNR and to extend some previously known results, thereby providing further 
insights into the notions of FDR and FNR and related measures, particularly under 
dependence. 

Suppose that X = (X1,..., Xn) has a joint distribution indexed by the set of 
parameters 0 = (041,...,04,). Let H,:6; < 0,9 be tested against K, :0, > 6,9, for 
some given 6,9, i — 1,...,n. Let (H,:i € Jo} and (H;:i € Jj) be the sets of true 
and false null hypotheses, respectively. It will be assumed that Jo is nonempty. 
Consider a single-step procedure that rejects H, in favor of K, if X; > t for some 
fixed t. Two of our main results with fixed Jp and J; (Theorems 1 and 3) are that 
if X is stochastically increasing in each 6j, which is typically the case in many 
multiple testing problems, then the maximum values of FDR and ENR of a single- 
step procedure are (no/n) P(R > 0) and (nj /n) P(A > 0}, respectively, where the 
probabilities are evaluated at 09 = (010, ..., 0,0) and X is assumed exchangeable 
under these null hypothesis values. In addition to representing more precise ver- 
sions of the results that state that Sidák and Bonferroni single-step procedures 
control FDR or FNR, these theorems show how these procedures can potentially 
be improved in terms of having better control of FDR or ENR borrowing infor- 
mation about no or n; from the data in the spirit of Benjamini and Hochberg [2], 
Benjamini, Krieger and Yekutieli [3], Storey [17] and Storey, Taylor and Siegmund 
[19]. Storey, Taylor and Siegmund [19] provided procedures for modifying the BH 
procedure using a family of estimates of no and proved that they control FDR 
under independence. We obtain new families of procedures: one to modify the 
FDR-controlling and the other to modify the FNR-controlling Sidák procedure. 
Considering independent test statistics, we prove that they control FDR or FNR. 
The modified Sidák FDR procedures are less conservative under independence 
than the corresponding family that modifies the Bonferroni procedure obtained 
by using the estimates of no considered in [19]. An analogous result is true for 
modified Sidák FNR procedures. Our method of modifying the Sidák FDR and the 
Sidák FNR procedures relies directly on two new results, Theorems 2 and 4, which 
extend inequalities given by Theorems 1 and 3, respectively, under independence 
from a single-step to a two-step procedure. 

Next, we derive certain results that extend Storey's [17, 18] work to the de- 
pendent case. Storey obtained expressions for the FDR and FNR of a single- 
step procedure under a mixture model where, given any configuration of true and 
false null hypotheses, the X,'s are assumed to be independent, providing useful 
Bayesian interpretations to his notions of pFDR and pFNR. More specifically, 
he proved: pFDR = P(Hij is true| X; > t} and pFNR = P{H; isfalse| X1 < t], 
irrespective of the number of tests. Assuming a more general mixture model in 


FDR AND FNR IN SINGLE-STEP TESTS 397 


which the X,'s are assumed to be dependent with a location family of distrib- 
utions and to have a certain type of positive dependence structure, we prove in 
Theorems 5 and 6, respectively, that pFDR < max;«; «4 P(H, is true| X, > t} and 
pENR x maxj<,<, P(H, is false| X, < t], with the equalities holding under inde- 
pendence. An important implication of the first inequality is that Storey's [17] 
q-value for a single-step multiple test under certain commonly encountered types 
of dependence is more conservative, as one would desire, than that under indepen- 
dence. 

The paper is organized as follows. In Section 2 we formally define the stochas- 
tic increasing property we need for X to obtain the maximum values of FDR and 
FNR for fixed Jọ and Jj. Section 3 reports the results related to FDR for fixed 
Jo and Jj, and some numerical results that show the performance of the modified 
Sidák procedure in controlling FDR compared to the modified Bonferroni and the 
original Bonferroni and Sidák procedures. Similar results related to FNR are pre- 
sented in Section 4, of course without showing any additional numerical evidence. 
Section 5 numerically compares the Bonferroni and Sidák procedures with their 
modified versions in terms of a concept of power involving both FDR and ENR. 
Section 6 presents the results on FDR and FNR under the aforementioned mixture 
model with dependent X. Proofs are given in Section 7. The paper concludes with 
some final remarks in Section 8. 


2. Stochastically increasing family of distributions. This section detines a 
type of stochastic increasing property of a family of distributions that will be 
required to establish our results on FDR and FNR. Whenever an increasing or 
decreasing condition or property in terms of X or 9 is mentioned, it is to be under- 
stood as being coordinatewise. 


DEFINITION 1. An n-dimensional random vector X = (X4,..., Xn) or the 
corresponding family of distributions (Ps), where 6 = (01,...,04), is said to be 
stochastically increasing in 0 if Pg{X € C] is increasing in 0 for any set C that is 
increasing. 


EXAMPLE 1 (Random variables with mixtures of independent stochastically 
increasing distributions). In multiple testing, the X,’s often have distributions 
that are mixtures of independent stochastically increasing distributions. That is, 
the density of Pg is often of the form 


fo) = | TT. »460). 
=! 


where fio (x, y) is stochastically increasing in 0, for each y and G is a proba- 
bility distribution independent of 0. A stronger condition—which is that for any 
0, « OF, fio! (x, y)/ fio, (x, y) is increasing in x for each i, the monotone likelihood 
ratio (MLR) condition of Lehmann [12] satisfied by many of the commonly used 


398 S. K. SARKAR 


distributions—is often useful to check for the stochastic increasing property of 
fa, (x, y) in 6,. The multivariate distribution of such random variables is stochas- 
tically increasing in 0. 


EXAMPLE 2 (Multivariate location family of distributions). Let the density 
of Pg be of the form fg (x) = f (x — 0). Distributions of this type are stochastically 
increasing. This is because, for any 0 « 0', we have 


Py (X € C) = Po{K EC — (0' -0)) > Po {KX eC). 


Many of the distributions that arise in multiple testing are of the type in Ex- 
ample 1 or 2. For instance, (i) independent normals with O,'s representing the 
means, (11) absolute values of independent normals with 0,'s representing the ab- 
solute means, (iit) independent chi-squares where 6;'s are the scale parameters or 
(iv) scaled mixtures of all these distributions, are of the type in Example 1. They 
arise in simultaneous testing of means or variances of independent normals against 
one- or two-sided alternatives. Multivariate In F that arises in many-to-one com- 
parisons of variances against one-sided alternatives is another distribution of the 
type in Example 1. Multivariate normal and multivariate f are distributions of the 
type in Example 2, arising, for instance, in Dunnett's many-to-one comparisons 
of means against one-sided alternatives in a one-way layout with a known or un- 
known common variance. 


3. Results on FDR for fixed Jo and J1. In this section we derive results on 
the FDR of a single-step procedure, assuming fixed, but unknown, Jo and Jj. We 
use the following notation here and in the rest of the paper. Define J = {1,...,7} 
and J(_,) = J — {i}. Define Xq) € ++- < X(n) as the ordered components of the 


set {X,:j7 € J} and x Sex 3525 as those of the subset {X,:j € J(—iy}. 
We assume that the marginal distribution of any X, depends on 0 only through the 
corresponding Ó,. 


First, we have the following lemma. 


LEMMA 1. The FDR of the single-step procedure with fixed critical value t is 
given by 


FDRo(t; Jo, J1) 
n=l py(XC? > t, X, 1t) 
-pjan D ee 
(3.1) 1€Jo jal (n—-j)n—-jt+ ) 
n—l P4(X C? >t X >t} 
UE Ee 
= P9{X(n) >t] m Y nux >t} js > Ed 
Ty a |@-Ya-jJ+H) 


FDR AND FNR IN SINGLE-STEP TESTS 399 


Now suppose that X is stochastically increasing in 6. Then, since the set 
(X65? > t, X, > t) is increasing in X, the probability Pp{X(? > t, X; > t} is 
increasing in 0. The probability Pe(X(,) > t) is also increasing in 0 because 
(Xin) => t} is an increasing set. Thus, using the first expression of the FDR in (3.1), 
we notice that it is decreasing in 6 and, hence, in (0, :i € Jı} for fixed (0; :7 € Jo}, 
whereas from the the second expression we see that it is increasing in (0, :i € Jo} 
for fixed (0, :i € Jı}. In other words, FDRg(t; Jo, J1) decreases as 0, moves away 
from 6,9 for at least one i € Jo or at least one i € J1, with 


(3.2) sup FDRe (t; Jo, J1) = FDRa(t; Jo, J1), 
0 


where 09 = (010, ..., 040). If X is exchangeable when 0 = 05 with the common 
marginal c.d.f. Fo, the right-hand side of (3.2) reduces to 


"ES Pal X >t, Xi t) 
rol Fo e 


j=l 


(3.3) = P DR (i; J, d) 


= 79 py (R > 0], 
n 


where Fo = 1 — Fo and ¢ represents a null set. Thus, we have the following theo- 
rem, which is one of the main results of this article. 


THEOREM 1. Jf X is stochastically increasing in 0, then FDRp(t, Jo, J1) de- 
creases as 0, moves away from 0jo for at least one i € Jo or for at least one i € J4. 
Furthermore, if X is exchangeable when 0 = 06, then 


(3.4) sup FDRe (t; Jo, Ji) = — Pa {R > 0}. 
0 n 


Theorem 5.3 of [5] gives the above decreasing property of FDR with respect to 
only (0, i € Jı} under the assumptions that (X,,i € Jo) and (X;, i € Jı} are jointly 
independent and (X;,i € Jı} is stochastically increasing in (0,,i € J1}. Theorem 1 
is a version of this for single-step procedures with dependent X and one-sided null 
hypotheses. 

As a corollary to Theorem 1, if the critical value t provides a level o test 
for the overall null hypothesis (5 , H,, that is, if ¢ satisfies P&(R > 0) = 
Po, {max,cy Xj > t) <a, then we have 


(3.5) FDRe(t; Jo, Ji) < Za, 
n 


implying that the FDR is controlled at a. Inequality (3.5) is interesting in that it 
represents a single-step analog of the same inequality known to hold for stepwise 


400 S. K. SARKAR 


procedures with Simes [16] critical values providing an a-level test for (^. H, 
[1, 5, 14]. Regarding the choice for t, if one does not want to utilize the distribu- 
tional form of X or if it is unknown, the Bonferroni critical value that satisfies 


(3.6) iu zd 
n 


can be used. If, however, X is known to be positively dependent so that the in- 
equality Pa {max eJ X, < t} > Fo (f) holds under the null hypothesis values with 
the equality holding under independence, as in the case of many distributions that 
arise in multiple testing, the Sidák critical value t that satisfies the equation 


(3.7) Fo(t) = (1 —a)'/" 


offers a less conservative choice. 

We should point out that there is no surprise that the Bonferroni and Sidák 
single-step procedures control FDR, because they are known to control the family- 
wise error rate (FWER). It is also known that, given mo, it can be incorporated in the 
Bonferroni and other procedures to improve their FWER control [10]. What is new 
here is that Bonferroni and Sidák procedures can be further improved in terms of 
having better control of FDR using an estimate of no, in the spirit of Benjamini and 
Hochberg [2], Benjamini, Krieger and Yekutieli [3], Storey [17] and Storey, Taylor 
and Siegmund [19]. For instance, since supg FDRe(t; Jo, J1) < nofl — Fo(t)], as 
we see from Theorem 1, rather than controlling n(1 — Fo(t)}, which the Bonferroni 
method does, a better control of FDR can be achieved if we control Ao{1 — Fo(1)] 
for some appropriately chosen estimate ño of ng. To estimate no, Storey [17] sug- 
gested using the ratio K,/Fo(t), where K; = $7 ., I(X, < t), for some well- 
chosen t. However, Storey, Taylor and Siegmund [19] slightly modified it and 
used 


K,+1 

Fo(t) 
to obtain a new class of BH-type FDR-controlling procedures under independence. 
We use this fo in our modification to the Bonferroni procedure. Also, the X,’s that 


are small compared to r should not be declared large when modified Bonferroni is 
used. Thus, our modified Bonferroni procedure rejects H; whenever 


=, a Folt) 


We prove later in this section that our modified Bonferroni procedure controls 
FDR under independence and we provide numerical evidence showing that quite 
often this control can be achieved much less conservatively. However, when X is 
known to be independent or at least positively dependent, a modification to the 
Sidák procedure is expected to produce a better performing procedure than the 
modified Bonferroni procedure. So, we first modify the Sidák procedure. The fol- 
lowing theorem suggests how the idea of modifying the Bonferroni procedure can 


(3.8) fig(1) = 








FDR AND ENR IN SINGLE-STEP TESTS 401 


be extended to that for the Sidák procedure. It extends the inequality for the FDR 
under independence, given by Theorem 1, from a single-step to a two-step proce- 
dure that, for some fixed t € (—oo, oo) and a predetermined function t; (k) > T, 
k — 0, 1, ..., n, first finds k = maxg<j<n,{i: Xq) < T} (note that X(9) = — oo), then 
rejects all H, for which X; > t, (Kk). 


THEOREM 2. Let X be independent with the distribution of X, , indexed by the 
parameter 0;, belonging to an MLR family and having identical marginals when 
0 = bo. Then, for a two-step procedure with t; (k) > t, for all k =0,1,...,n, the 
FDR satisfies the inequality 


FDR$(t, > t; Jo, J1) 


— 


(3.10) « Fo(t) V. y h -— (1 = AGO 


Folt) 


x B <t< X] 


(with X5, = —oo and Xt... = oo). 

When t = —oo, k = 0 with probability 1 and (3.10) reduces to the one given 
by Theorem 1 under independence with t = 1.55 (0). It is interesting to see that 
FDR§ (t; > t; Jo, J1)  FDRe(r; Jo, J1). 

The modified Bonferroni procedure is a two-step procedure with t, (k) given by 
the right-hand side of (3.9) given Kr = k; that is; t; (k) is such that Folt (E)) = 
min{ Fo(t), a Fo(t)/(k + 1)). We propose to modify the Sidák procedure using a 
two-step procedure where t; (K) is such that 


a (n — k)Fo(t) ^ 
- (kK +1)Fo(t) 


with t, (n) = oo. The right-hand side of (3.10) for this modified Sidák procedure is 
less than or equal to 


(3.11) Fo(t(k) = Aoi = (1 = min[ 1 


nl Fo(t) 





(—1) (—1) 
Po {XQ SUE Xo] 


n 
1 
<a = Ppl X; <1, X <r < XCP 
(3.12) = a el d (k—1) — **(Kk) | 


= 


=a ) PeXq <T < Xa+) 
k=l 


= a P{ X0) <T}; 


402 S. K. SARKAR 


see, for example, [13], page 497, for the first equality in (3.12). Thus, we see that 
our modified Sidák procedure controls FDR under independence. 
The right-hand side of (3.10) is less than or equal to 


n-—i 


(3.13) 35 3 Fo) Po (XS <t x Xo 
1€ Jo k==0 


which, for the modified Bonferroni procedure, is less than or equal to the first ex- 
pression in (3.12). Thus, the FDR of the modified Bonferroni procedure is also less 
than or equal to a Pa( X(1 < v) and, hence, is controlled; of course, it is controlled 
more conservatively than the modified Sidák procedure. 

We conducted a numerical study to investigate the extent of improvement 
offered by our modified Sidák procedure in controlling FDR over the modi- 
fied Bonferroni and the original Bonferroni and Sidák procedures. Wé generated 
n = 100 dependent random variables X, ~ N (p, 1), i = 1,...,100, with the same 
variance 1 and a common correlation p, and performed 100 hypothesis tests of 
I, = O against u > 0, each using first the Bonferroni critical value and then the 
Sidák critical value corresponding to a = 0.05. The value of Q was then calcu- 
lated for each procedure by setting ng of the 44's to zero and the remaining j;’s 
to a positive value ô. The FDR then was estimated by averaging the Q values 
over 5000 iterations. Thus, we have the simulated FDR of the Bonferroni and 
Sidák procedures. We chose Fo(r) = 1/2 and similarly calculated the FDR of the 
modified Bonferroni and Sidák procedures corresponding to this v. Table 2 com- 
pares the FDRs of the Bonferroni and Sidák procedures and their modification for 
no = 30, 50, 70 and 90, p = 0 (independent) and 0.5 (dependent), and for different 
values of 5. The last row of this table gives the maximum of the standard errors of 
the estimated (simulated) FDRs in each column. 

As we expected, the modified Sidák procedure provided the least conservative 
control of FDR under independence. Since the Bonferroni and Sidák procedures 
are relatively more conservative when the actual proportion of true null hypotheses 
is small, the idea of improving them using an estimate of no should work well in 
this situation. This 1dea is confirmed by our numerical study. Both modified Bon- 
ferroni and modified Sidák procedures are seen to control FDR much less conserv- 
atively than their unmodified versions under independence. In the dependent case, 
however, the idea of improving the Bonferroni and Sidák procedures may not work 
unless zo is small and the dependence is weak. 

Having found more than one procedure that can control the FDR under inde- 
pendence (e.g., the Bonferroni, Sidák and their modifications), comparing them 
further in terms of power seems to be the next important objective. While the 
idea of power can be conceptualized in terms of Type II errors (false negatives) 
in several different ways, extending it from single testing to multiple testing, one 
particular concept, which is the average power [1.e., i-E(S)], has been used in a 


FDR AND FNR IN SINGLE-STEP TESTS 403 


TABLE 2 
Simulated values of the FDR of the Bonferroni and Sidák procedures and their modificaiions 
with a = 0.05 
Independent (p = 0) 
Bonferroni Sidák 
Original Modified Original Modified Original Modified Original Modified 


Dependent (p = 0.5) 


Bonferroni Sidák 


no ô 
30 0.5 0.0118 0.0150 0.0119 0.0167 0.0048 0.0167 0.0049 0.0332 
15 | 0.0045 0.0073 0.0045 00079 0.0006 0.0066 0.0006 00412 
2.5 0.0008 0.0019 0.0008 0.0022 0.0002 0.0054 0.0002 0.0412 
50 05 00218 0.0259 00222 0.0276 0.0092 0.0307 0.0093 0.0493 
1.5 0.0103 0.0147 0.0106 0.0149 0.0015 0.0116 00015 0.0455 
2.5 0.0021 0.0031 0.0021 0.0033 0.0006 0.0093 0.0006 0.0441 
70 0.5 0.0315 0.0349 00319 0.0359 0.0141 0.0488 0.0144 0.0667 
1.5 0.0187 0.0237 0.0189 0.0232 0.0034 0.0196 0.0034 0.0494 
2.5 0.0052 0.0061 0.00532 00061 0.0013 0.0154 0.0014 0.0463 
90 0.5 0.0414 0.0423 00423 0.0434 0.0234 0.0734 0.0240 0.0903 
1.5 0.0351 0.0382 0.0359 0.0393 0.0108 0.0414 0.0111 0.0642 
2.5 0.0173 0.0180 0.0175 0.0189 00045 0.0311 0.0046 0.0554 
MaxSE 0.0028 0.0028 00028 0.0029 0.0020 0.0034 0.0021 0.0038 


number of recent papers to compare FDR-controlling procedures [4, 17, 19]. How- 
ever, it is argued in [15] that since the FDR is a measure of false positives, 1t seems 
more appropriate to compare different FDR-controlling procedures using a similar 
measure in terms of false negatives, the FNR [8, 15]. It will be interesting to see 
how the different FDR-controlling prvcedures in this paper compare in terms of 
measures involving FNR under the same distributional setting. This will be carried 
out in Section 5 after deriving some results on ENR in the next section. 


4. Results on FNR for fixed Jo and Jı. We will derive in this section some 
results on ENR of a single-step procedure, analogous to those on FDR, again as- 
suming a fixed configuration of true and false null hypotheses. First, we have the 
following lemma. 


LEMMA 2. An explicit expression of FNR is 


FNR (t; Jo, J1) 
n-l pP(XC <t, X, « t] 
-yfr en $M senen 
(4.1) i€JI j=l JO E ) 
n-l Pol KX) <t, Xi <t} 
(7) d 
= PyXq «10-9». Lr <t}- 5 | 
(€ Jo j=] JO ER 1) 


404 S. K. SARKAR 


Making the same kind of arguments as we made before for the monotonicity 
property of the FDR, we notice that if X is stochastically increasing in 0, the FNR 
is increasing in (0,:i € Jo) for fixed (0; :i € Jı} and is decreasing in (0,:i € J1} 
for fixed (0, :i € Jo}. In other words, FNRa(t; Jo, J1) decreases as 0; moves away 
from 6,9 for at least one i € Jo or at least one i € Jj, with 


(4.2) Sup Oe Jo, J1) = FNRa (t; Jo, J1). 


Since, when 0 = 0o, X is exchangeable, the right-hand side in (4.2) reduces to 
n-l pa (XD <t, Xi <t} 

mg). reu 
(4.3) nO) en ues pene 
| 2. jG T1) 


j=l 
The equality in (4.3) follows from (4.1); see also [13]. This gives the next main 
result of this article. 


| - — Pa [A > 0). 


THEOREM 3. /f X is stochastically increasing in 0, then FNRe (t, Jo, J1) de- 
creases as 0; moves away from 0,0 for at least one i € Jo or for at least one i € Jj. 
Furthermore, if X is exchangeable when 0 = Qg, then 


(4.4) sup FNRe (f; Jo, Ji) = — PA (A > 0). 
0 n 


Clearly, the FNR of a single-step procedure can be controlled at a level B un- 
der the condition stated in the above theorem by choosing a fixed t subject to 
the condition P4,íA > 0} = Pg, (minjic; Xi; < t} < B. If the dependence struc- 
ture of X is not utilized, the equation Fo(t) = B/n provides a Bonferroni-type 
choice for t. When X is known to be positively dependent so that the inequality 
Pa, (minje; X, > t) > FẸ (t) is true, with the equality holding under independence, 
Sidák-type t can be determined from the equation Fo(t) = 1 — (1 — 8)!/^. These 
procedures can potentially be improved in terms of having better control of FNR by 
borrowing information from the X,’s exceeding an appropriately chosen value r. 

The following theorem is a FNR analog of Theorem 2 that extends the inequal- 
ity on ENR given by Theorem 3 from a single-step to a two-step procedure and 
suggests how to modify the above single-step FNR-controlling procedures. 


THEOREM 4. Under the conditions stated in Theorem 2, the FNR of a two- 
step procedure with t, (k) < t for all k =0,1,...,n satisfies the inequality 


FNRY (t, < c; Jo, Ji) 


NE d] E (-1) 
F, S ee | ux x 
< Y Yi ( F(t) AX G1 <T XQ ] 


(4.5) 


FDR AND ENR IN SINGLE-STEP TESTS 405 


When t — œ, k =n with probability 1 and the above inequality reduces to that 
given by Theorem 3 under independence with f = to; (n). We modify the Sidák 
procedure using a two-step procedure with £, (k) < t satisfying 


Bk Folt) N] 
(n —k+1)Fo(t) 


and t, (0) == —oo. For this modified Sidák procedure, 
FNRI? (t, <T; Jo, J1) 


2. Fo(t) 
=f) A —— PX x XQ } 


(4.6)  Folte(&) = ml) 1 7 (1 " minl1, 





i€J k= " n—k-cl 
up z Po >, E pF 
(4.7) reJ, k-0 " 
(—1) (—1) 
res k-0" 
n—1 
—p5»,Pe[Xq) «t € Xk+] 
k=0 


= BPo{X(n) > ae 


The second inequality in (4.7) follows from the fact that Fo(t) < Fs, (ti; for the 
first equality, see [13]. Thus, the above modified Sidák procedure controls FNR 
under independence. 

The right-hand side of (4.5) is less than or equal to 


n 
(4.8) 3: Y Fo) Po (X63 < c < XO]. 
ied; k=1 
This is less than or equal to the right-hand side of the first inequality in (4.7), which 
is less than or equal to B, if we choose t (k) < t satisfying 
B Fo(z) | 
4.9 Folt: (k)) = Folt), ————— |. 
(4.9) »6.0) = min[ fo), AP 
This gives us our FNR-controlling modified Bonferroni procedure, which is of 


course more conservative than the modified Sidák procedure in the sense that it 
allows less nondiscoveries. 


REMARK |. Itis important to note that the above results on FNR have been 
developed with the idea of controlling false nondiscoveries of any set of true al- 
ternatives (or false nulls). However, one is often interested in controlling false 


406 S. K. SARKAR 


nondiscoveries of a prespecified set of true alternatives. These results can be easily 
modified in such a situation. Let 6; = 0,1 for some specified 0, > 6,9, i € J1. As- 


sume that X is exchangeable under 0 = 6; = (0;11,...,0,1). Then Theorem 3 can 
be modified to 
(4.10) sup FNR(t; Jo, J1) = — Pp, (A > 0} 
0 n 
and Theorem 4 can be modified to 
FNR(? (t, < v; Jo, Ji) 
2d Fit (k)) \* 

(4.11) sno Y -0-—-—)| 

fel, kl Fir) 


x Py X& <t< x 
where F} is the common c.d.f. of X, under 0,1. The Bonferroni and Sidák proce- 
dures as well as their two-step modifications using critical values based on F will 
provide better control of FNR in this case than values based on Fo. 


We conducted a numerical study to investigate how well these different FNR 
procedures control FNR under a specified set of true alternatives. We noticed, as 
in the case of controlling FDR, that although both modified Bonferroni and Sidák 
procedures often control FNR much less conservatively than their unmodified ver- 
sions, the modified Sidák procedure provides the best control of FNR. 


5. A numerical study. In this section we compare the different 
FDR-controlling procedures under independence discussed in Section 3 in terms 
of a concept of power that relates to the unbiasedness condition Sarkar [15] in- 
troduced. Since the FDR measures the expected proportion of incorrect decisions, 
a good multiple testing procedure must ensure that it does not exceed the ex- 
pected proportion of correct decisions. The quantity 1 — FNR, which Genovese 
and Wasserman [8] called the correct nondiscovery rate, is a measure of correct 
decisions. In situations where controlling false negatives is of primary importance, 
the FNR provides a measure of incorrect decisions with the corresponding measure 
of correct decisions being 1 — FDR. Whether we have a multiple testing procedure 
designed to control FDR or FNR, the inequality FDR + FNR < 1 represents a 
desirable property for any such multiple testing procedure. This is referred to as 
the unbiasedness condition of an FDR- or FNR-controlling multiple testing pro- 
cedure. A natural way to compare different FDR- or FNR-controlling procedures 
would be to see how they perform in terms of a measure that reflects the strength 
of unbiasedness. This leads us to the consideration of the quantity 


(5.1) zo = 1 — FDRs — FNRg. 


FDR AND FNR IN SINGLE-STEP TESTS 407 


It is also related to the idea of Genovese and Wasserman [8], who suggested us- 
ing 1 — 79 as a risk function to compare multiple testing procedures. This is our 
concept of power. 

We investigated how the different FDR procedures in Section 3 perform in terms 
of the aforementioned concept of power. We computed the FNR and then the power 
1 — ENR — FDR for the Bonferroni and Sidák procedures and their modified ver- 
sions [with Fo(t) = 1/2] based on the normal data that have been simulated before 
for FDR calculations. These simulated powers are displayed in Figure 1. As we see 
from this figure, the modified Sid4k procedure is often the most powerful under in- 
dependence, especially, as one would expect, when the proportion of true null hy- 
potheses is relatively small. The unmodified Bonferroni and Sidák procedures, not 
surprisingly, are practically indistinguishable in terms of their power performance. 
One should, however, be cautious in interpreting this graph in the dependent case 
(particularly, the upper right two panels), in light of Table 1, which indicates that 
the modified Bonferroni and Sidák procedures may fail to control FDR unless the 
dependence is weak and no is small. 

We should point out that the unbiasedness property of the single-step proce- 
dures, which is numerically seen to hold, can be theoretically proved easily from 


————-  Bonfer Mbon ----  Meiidák —-—— $icdk 


o5 19 16 25 26 39 o6 10 $5 20 25 30 


1-FDR-FNR 





es 10 15 26 25 36 ob 190 15 29 26 30 


delta 


FIG. 1. Comparison of Bonferroni and Sidák procedures with their modified versions in terms of 
1 — FDR — FNR. 


408 S. K. SARKAR 


Theorems 1 and 3. However, a theoretical justification of the same property for the 
two-step procedures, which appears to be also true from Figure 1, is an interest- 
ing and a more challenging theoretical problem. Also, the same concept of power 
could be used to compare different FNR-controlling procedures. 


6. Results on FDR and FNR under a mixture model. In this section, we 
present appropriate modifications to Lemmas 1 and 2 when a mixture approach 
is taken as in [7, 17]. We will, however, assume a slightly more general mixture 
model in the sense that it does not assume independence of the test statistics. More 
specifically, we first let H = (H1,..., H4), with H; = 0 indicating that H, is true 
and H; = 1 indicating that it is false. Then we assume that (X,, Hj), i — 1,...,n, 
have the distribution 


X|H~ f(x,6q) | where 64 = (On, ...,0p,), On, = (1 — H,)0, + H,6j, 


(6.1) with 0; < bio, 6; > bioi — 1, ...,n, 
.. and H~ mp, where mp are some probabilities defined on 


JC = {h = (Ai, ..., hy) :h, =O or 1}. 


Regarding f, we assume that it belongs to a location family of distributions; that 
is, f (x, Og) = f(x — 0g), with a positive dependence structure that ensures that, 
for any increasing (or decreasing) function $ of X, the expectation E{@(X)|X,, H} 
is increasing (or decreasing) in X,. This is true if, for instance, X is positive re- 
gression dependent on subset (PRDS) under the density f(x), as in the case of 
multivariate normal with positive correlations and many other multivariate distrib- 
utions encountered in multiple testing; see, for example, [5, 14]. Of course, when 
(X,, Hi), i — 1,...,n, are independent, we assume no particular form for the den- 
sity f; that is, we simply assume that X,|H, ~ f (x, 0p, ). Since we assume that 6, 
takes the value 0/ when H, = 0 and the value 0” when H, = 1, the probabilities in 
the following discussion are all evaluated under these fixed 6’ = (61,...,0,) and 
g” gz (01, or or). 


THEOREM 5. Under the above mixture model and the conditions assumed 
therein, 


(6.2) FDR(, n) < 5 ô, P(H, —0|X; > t), 
1=1 
where 
n1 P(XEP » t, X, >t) 
(7) LL tn 
er Ate p e ce 
a 2- («s Jes gale 
(6.3) 
n 
5 = P{R > 0}, 


i=] 


with the equality holding when the (X,, H,)’s are independent. 


FDR AND FNR IN SINGLE-STEP TESTS 409 


When (X,, H,), i — 1,...,n, are identically distributed, Theorem 5 reduces to 
(6.4) FDR(t, n) x P(H; =0|X; > t) P(R > 0). 
The equality in (6.4) holds when (X,, Hj), i — 1,...,n, arei.i.d., which is Storey’s 
[17, 18] result, providing a “Bayesian Type I error rate" interpretation to his notion 


of pFDR = FDR/P(R > 0}. Thus, the following corollary to Theorem 5 is an 
extension of his result to the dependent case. 


COROLLARY 1]. Under the above mixture model and the conditions assumed 
therein, 


(6.5) pFDR(t, n) < max P{H; = OLX; > t]. 
<i<n 


When the (X,, H,)’s are identically distributed, we have 
(6.6) pFDR(t, n) x P(Hi —0|X1 2 t), 
with the equality holding when the (X,, Hj)'s are i.i.d. 


Storey [17] introduced a pFDR analog of the p-value, called the q-value, that 
provides a measure of the strength of the tests in a multiple testing procedure with 
respect to pFDR. For a single-step multiple testing procedure of n hypotheses with 
a rejection region of the form X, > t for each H,, it is defined as 


(6.7) gn (t) = inf pFDR(x, n). 


Storey [17], however, considered this quantity when (X,, H,), i — 1,...,n, are 
1.1.d., which is 
(6.8) q(t, Hi) = inf PL; —0|X1 = x}. 


Corollary 1 says that when the (X,, H,)'s are dependent with common marginals, 
in the sense assumed in that corollary, we have q,(t) < g(t, Hj). That is, the 
q-value of a single-step multiple test procedure obtained under certain commonly 
encountered types of dependence is more conservative, as one would want, com- 
pared to the corresponding 1.1.d. case. 


THEOREM 6. Under the conditions stated in Theorem 5, 


n 
(6.9) FNR < » ^s, P(H, = 1|X; « t), 
Ux] 


where 


HOU P(X) Up ser] 
yi = P{X, «1) Y — 2 

= GED 
(6.10) 


n 
in = P{A > 0), 
i=] 


and 


410 S. K. SARKAR 
with the equality holding when the (X,, H,)’s are independent. 


This theorem can be proved following arguments similar to those used to prove 
Theorem 5 and with the help of an identity for P(A > 0) given by Sarkar [13]. 
COROLLARY 2. Under the conditions stated in Theorem 5, 
(6.11) pFNR x max P(H, = 1|X; « t). 
] xin 


When the (X;, H,)'s are identically distributed, we have 
(6.12) pFNR x P{H; —1|X, <t}, 
with the equality holding when the (X,, H,)' s are i.i.d. 


7. Proofs. 


PROOF OF LEMMA 1. The FDP is given by 


1 
(7.1) Q(t; Jo, 1) = D> S rani. > t). 


1€ Jg jao” 


since {R =n — j} ={X qj) <t < Xo+} with X) = —oo and X(441) = oo, we 
have 


(R=n= jX ceu a ux m]. 
Therefore, 
n—1l 1 
O(t; Jo, Ji) = Y: PP ; [Xt «tx Xj, Xi = t} 
Tm 
(7.2) => a [I [X525 x 6 X: m t] AXE m tX = tJ] 
reh j=0” 


HX >t, X, >t) 


-Y ux -DE | yy 


1€ Jo iéJo j=l 


Taking the expectation in (7.2), we get the first expression of the FDR in Lemma 1. 
The second expression follows from the fact that Q reduces to [{R > 0) = 
I{X(n) = t} if we consider the first summation in (7.2) over alli €e J. O 


PROOF OF THEOREM 2. First note that 
FDRÜ (t, > t; Jo, J1) 


(7.3) 
= Y Es Q(tr(k); Jo, 3H {Xa <T € Xin]]- 


FDR AND FNR IN SINGLE-STEP TESTS 411 


Since f; (k) > t for all k, when k =n (i.e., when X(n) < T), there is no rejection 
of null hypotheses, implying that Q = 0. 

Let Fe (x) and fs (x), respectively, be the c.d.f. and the density of X, under 
any alternative 0, for i = 1,...,n. Since the X,’s are assumed to be indepen- 
dent, the conditional expectation of Q(t;(k); Jo, Ji), given {Xik) < t € X») 
fork — 0,1,...,n — 1, is the FDR of the single-step procedure based on n — k 
independent random variables Y;,..., Y, y with Y, ~ fo, (x)I (x > 1)/ Fo, (1) and 
critical value t, (k). Since the density of Y, has the MLR property, implying that 
(Yi, ..., Yn—x) is stochastically increasing, we have from Theorem 1 that this con- 
ditional expectation is 


no(t) 
< noe S P, max Y, zs) 


where no(t) = 5 /;ej, I (Xj = t). Going back to (7.3), we then have 
FDRÉP (t, > v; Jo, J1) 


Erell- 6-897 


(7.4) 


k=01€Jo 
(7.5) x IX, >tXAM<ts Xar) 
es I E ( _ Foe) NN 
icho” k Fo(t) 


x Po|X, » t, Ko «TX x 
which is the required inequality in Theorem 2. C 


PROOF OF LEMMA 2. The FNP is given by 


n 
l 
N(t; Jo, Ji)= 3, 3, HA =j, Xi <t) 
i€J1 jx] 


E 
= 2. XW <t € Xqap, Xi <t} 
ied, j=l J 
(7.6) 


n 
1 R = 
= RG) «nx ed- rii nx ed] 
rej, 7-1 


I XGP <t, X, <t} 


= HX% «0-9». 


1€J1 tEJ j=l JU T I) 


412 S. K SARKAR 
Taking the expectation of (7.6), we get the first expression of the FNR. The second 
expression follows from the fact that 

N (t; Jo, J) = HA > 0]  I(UI(A > uu 


on HX) « t] -5 Y- -HA — j, X, <t}. 


1€Jg j= 1 J a 


PROOF OF THEOREM 4. We have 
FNRÍ? (t; < v; Jo, Ji) 


= => Eo{[N(te(k); Jo, 1)! Xa) < * € Xn] 
(7.8) x I[Xq) <T < X&«5]] 
? (1 Folt (kX)) \* 
<X Diz- (1- o( 220 
icJ k=l Fol) 
(^1) (—1) 
x Po{X, <T, Xa <T € Xi J}. 


The inequality in (7.8) follows from Theorem 3, noting that the conditional expec- 
tation of N (t: (k); Jo, J1), given (Xqy < t € Xh, is the FNR of the single-step 
procedure based on independent Zi,..., Zy with Z, ~ fa (x)I(x < t)/Fo, (1). 
The required inequality in Theorem 4 then follows from (7.8) because Fg (1) is 
decreasing in 6; fori e Jj. O 


PROOF OF THEOREM 5. Since V = 5 7. AI(X, > 0)1(H, = 0), we first note 
from Lemma 1 that the FDR under the mixture model is given by 


FDR(t, n) = Seal P(X, > t|H, — 0] 
[as] 


n-l pry) — t, x, > t|H with H, = 0) 


2:3 59 = | 


j=l (n — j)(n — j +1) 


n n-1 P(XC >t, X, t, Hi =0} 
(7.9) = P{X, >t, H, 2001— c cu p Oe aE 
| ehh =O 2 Raed 


— 


~ 


nd 
w——— 


i 


n 
Y [px > t, A, = 0} 
pl 


OS PU! 2 (X, 2 1 Mm =0) 
-È (n — j)n — j 1) l 


j=l 


FDR AND FNR IN SINGLE-STEP TESTS 413 


We now prove that 
(7.10) P[XC > tlXi x t, H, =0} > PUXEP m tXi mt] 


under the assumed positive dependence condition of the density f of X. 
Let w(X,) = P(XG? > :X,,0, = 0). Then the conditional probability 
P(XEY > t|X, > t, 0} can be written as 


pas EAA: Zt - 60) 

E(I(X, >t —6i)) 
with the expectations taken with respect to X; under 6, = 0. Note that y(x) is an 
increasing function of x under the assumed positive dependence condition of f. 
Also, I (x > t) is a totally positive of order two (TP2) function of (x, t) (see, e.g., 
[11 p. Therefore, the ratio 


EQ (X,)1I (GG 2 0j 


eed EUG 2D) 


is increasing in t, because it is the expectation of an increasing function of a 
random variable whose distribution is stochastically increasing in t. This proves 
that P{X()? > tX; >t, H, =0} > P(X? > tX, z t, H, = 1), implying that 
the probability P(X TA > t|X, > t), being a convex combination of P(X T > 
t|X, > t, H; = 0) and PX > f[X, > t, H, = 1}, is less than or equal to 


P(XCy) > t|X, >t, H, 0). Thus the required inequality (7.10) follows. 
Applying (7.10) to (7.9), we get the inequality (6.2) to be proved in the theorem. 
The fact that 
n 
(7.13) 34 = P| max Xi i| = P(R> 0} 
laren 


i=] 


follows from [13]. Furthermore, it is clear that the equality in (6.2) holds under 
independence of (X,, H,). Thus, the theorem is proved. [.] 


8. Concluding remarks. We have obtained in this article some theoretical re- 
sults that extend previous work done under the assumption of independent tests. 
Two of these set the stage for developing our idea to modify the FDR- and 
FNR-controlling Bonferroni and Sidák procedures and obtaining wider families 
of FDR- and FNR-controlling procedures. We developed this idea by extending 
inequalities for FDR and FNR under independence from single-step to two-step 
procedures. In the case of the Bonferroni procedures, it is somewhat similar to what 
Storey, Taylor and Siegmund [19] used to modify the FDR-controlling BH proce- 
dure (which is, of course, a stepwise procedure) under independence. In the case of 


414 S. K. SARKAR 


Sidák procedures, however, it is stronger in that we consider modifying less con- 
servative procedures. It is important to point out that modifying the Sidák proce- 
dure by simply finding ¢ that controls (6o/n)(1— Fj (1)) (the estimated maximum 
FDR, which is basically the idea in modifying the FDR-controlling Bonferroni 
procedure), does not seem to provide much improvement to the Sidák procedure. 
The same is true for the FNR-controlling Sidák procedure. This is what we have 
noticed based on additional simulations not reported here. Also, as is seen from 
Table 2, we need to be cautious using the present modifications when there is too 
much dependence in the tests; they may become anticonservative. Procedures that 
control FDR are different from those that control FNR. It will be interesting to see 
if procedures that control both FDR and ENR can be developed using the results 
discussed in this paper. 


Acknowledgments. The author thanks an Associate Editor and two referees 
for valuable comments, which have resulted in an improved paper, and Tianhui 
Zhou for her help with the numerical calculations. 


REFERENCES 


[1] BENJAMINI, Y. and HOCHBERG, Y. (1995). Controlling the false discovery rate. A practi- 
cal and powerful approach to multiple testing J. Roy Statist. Soc. Ser B 57 289-300. 
MR1325392 

[2] BENJAMINI, Y. and HOCHBERG, Y (2000). On the adaptive control of the false discovery rate 
in multiple testing with independent statistics J Educational and Behavioral Statistics 
25 60-83. 

[3] BENJAMINI, Y., KRIEGER, A M. and YEKUTIELI, D. (2002). Adaptive linear step-up false 
discovery rate controlling procedures. Unpublished manuscript. 

[4] BENJAMINI, Y. and Liu, W. (1999). A step-down multiple hypotheses testing procedure 
that controls the false discovery rate under independence. J. Statist. Plann. Inference 82 
163—170. MR1736441 

[5] BENJAMINI, Y. and YEKUTIELI, D. (2001). The control of the false discovery rate in multiple 
testing under dependence. Ann. Statist. 29 1165—1188. MR1869245 

[6] EFRON, B. (2003). Robbins, empirical Bayes and microarrays Ann. Statist. 31 366-378 
MR1983533 

[7] EFRON, B., TIBSHIRANI, R., STOREY, J. D. and TUSHER, V. (2001). Empirical Bayes analy- 
sis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160. MR1946571 

[8] GENOVESE, C. and WASSERMAN, L. (2002). Operating characteristics and extensions of 
the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 499-517. 
MR 1924303 

[9] GENOVESE, C. and WASSERMAN, L. (2004). A stochastic process approach to false discovery 
control. Ann. Statist 32 1035-1061 MR2065197 

[10] HOCHBERG, Y. and BENJAMINI, Y. (1990) More powerful procedures for multiple signifi- 
cance testing. Statistics in Medicine 9 811—818 

[11] KARLIN, S. (1968). Total Positivity 1. Stanford Univ. Press. MR0230102 

[12] LEHMANN, E. (1986). Testing Statistical Hypotheses, 2nd ed. Wiley, New York. MR0852406 

[13] SARKAR, S. K. (1998). Some probability inequalities for ordered MTP. random variables: 
A proof of the Simes conjecture. Ann. Statist. 26 494—504. MR1626047 


FDR AND FNR IN SINGLE-STEP TESTS 415 


[14] SARKAR, S. K (2002). Some results on false discovery rate in stepwise multiple testing pro- 
cedures. Ann. Statist. 30 239—257. MR1892663 

[15] SARKAR, S. K. (2004). FDR-controlling stepwise procedures and their false negatives rates. 
J. Statist. Plann. Inference 125 119-137. MR2086892 

[16] SIMES, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance 
Biometrika 73 751-754 MR0897872 

[17] STOREY, J D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. 
Methodol. 64 479—498. MR1924302 

[18] STOREY, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the 
q-value. Ann. Statist. 31 2013-2035 MR2036398 

[19] STOREY, J D., TAYLOR, J. E. and SIEGMUND, D. (2004). Strong control, conservative point 
estimation and simultaneous conservative consistency of false discovery rates: A unified 
approach. J. R. Stat. Soc. Ser. B Stat. Methodol. 66 187—205. MR2035766 


Fox SCHOOL OF BUSINESS AND MANAGEMENT 
TEMPLE UNIVERSITY 

SPEAKMAN 319 

1810 NORTH 13TH STREET 

PHILADELPHIA, PENNSYLVANIA 19122-6083 
USA 

E-MAIL: sanat temple edu 


The Annals of Statistics 

2006, Vol 34, No 1, 416-440 

DOE 16 1214/009053605000000732 

© Institute of Mathematical Statistics, 2006 


POISSON CALCULUS FOR SPATIAL NEUTRAL TO 
THE RIGHT PROCESSES! 


BY LANCELOT F. JAMES 
Hong Kong University of Science and Technology 


Neutral to the right (NTR) processes were introduced by Doksum in 1974 
as Bayesian priors on the class of distributions on the real line. Since that time 
there have been numerous applications to models that arise in survival analy- 
sis subject to possible nght censonng. However, unlike the Dirichlet process, 
the larger class of NTR processes has not been used in a wider range of more 
complex statistical applications. Here, to circumvent some of these limita- 
tions, we describe a natural extension of NTR processes to arbitrary Polish 
spaces, which we call spatial neutral to the right processes Our construction 
also leads to a new rich class of random probability measures, which we call 
NTR species sampling models. We show that this class contains the impor- 
tant two parameter extension of the Dirichlet process. We provide a posterior 
analysis, which yields tractable NTR analogues of the Blackwell~MacQueen 
distribution. Our analysis turns out to be closely related to the study of regen- 
erative composition structures A new computational scheme, which 1s an or- 
dered variant of the general Chinese restaurant processes, 1s developed. This 
can be used to approximate complex posterior quantities We also discuss 
some relationships to results that appear outside of Bayesian nonparametrics. 


1. Introduction. Doksum [9] considered a nonparametric Bayesian analysis 
based on neutral to the right (NTR) priors. These priors are random probability 
measures defined on the real line, R, that include the popular Dirichlet process 
(see [13] and [16]). Within Bayesian nonparametric statistics, the NTR process 
serves as one of the important classes of models. In particular, there have been 
numerous applications to models that arise in survival analysis subject to possible 
right censoring. On the other hand, unlike the Dirichlet process, the larger class 
of NTR processes has not been used in a wider range of statistical applications. 
That is, for instance, there are no general NTR analogues of the important class of 
(kernel based) Dirichlet process mixture models. See, for example, [32] and [23] 
for further background and references on Dirichlet process mixture models. 

A goal of this article is to begin to answer the question of how one can possi- 
bly use NTR processes in a wider context, as has been the case for the Dirichlet 
process. One of the limitations of NTR processes is that they are only defined on 


Received May 2003; revised March 2005. 
| Supported ın part by RGC Grant HKUST-6159/02P and DAG 01/02.BM43 of HKSAR. 
AMS 2000 subject classifications. Primary 62G05, secondary 62F15. 
Key words and phrases Bayesian nonparametrics, inhomogeneous Poisson process, Lévy 
processes, neutral to the right processes, regenerative compositions, survival analysis. 


416 


SPATIAL NTR PROCESSES 417 


the real line. The other limitation, which is perhaps more severe, is that as of yet we 
do not have tractable NTR analogues of the Blackwell~MacQueen [3] Pélya urn 
distribution associated with the Dirichlet process. The Blackwell-MacQueen dis- 
tribution is well known to be the exchangeable distribution derived from a Dirichlet 
process, and its theoretical understanding and practical implementation are crucial 
in complex models. To circumvent some of these limitations, we describe a nat- 
ural extension of NTR processes defined on an arbitrary Polish space 4 = Rt x X, 
which we call spatial NTR processes. Here Rt denotes the positive real line and 
X is an arbitrary Polish space. Our construction also leads to a rich class of ran- 
dom probability measures on X, which we call NTR species sampling. We provide 
a detailed analysis of these models and obtain properties analogous to the Dirich- 
let process. In particular, we provide a description of the posterior distribution of 
spatia] NTR processes and, more importantly, we give a detailed analysis of the 
NTR analogues of the Blackwell-MacQueen distribution. 

such an analysis parallels, in part, the results of Antoniak [1] (see also [12]) 
and Lo [32] for the Dirichlet process. These works involve characterizations based 
on random partitions of the integers {1,...,} and were derived using nontrivial 
combinatorial arguments. The structure of general NTR processes is more com- 
plex than that of the Dirichlet process and an approach using direct combinatorial 
analysis is considerably more challenging. We circumvent such issues by apply- 
ing the Poisson process partition calculus discussed by James [24, 26]. This also 
paves the way for a straightforward derivation of the posterior distribution of spa- 
tial NTR processes. Using these results, we develop a new computational scheme 
related to the general Chinese restaurant process (see [37], page 60 and [23]), 
which now allows one to sample from the exchangeable distributions derived from 
NTR processes. 

It is important to note that although Bayesian applications of NTR processes to 
complex statistical models have been limited, the use of these processes appears 
often in other important contexts. Doksum ([9], Theorem 3.1) showed that one can 
describe an NTR distribution function F on RY via positive Lévy processes, Z, 
on Rt as 


(1) ] — F(t) = S(t) e 49, 


where S denotes the survival distribution of a random variable T from F. The Lévy 
process Z is an increasing independent increment process that satisfies Z(0) — 0 
and lim;.,o5 Z(t) = oo a.s. That is, T|F has survival distribution P(T > t|F) = 
e 40. Importantly, the representation in (1) shows that NTR survival processes 
essentially coincide with the class of exponential functionals of possibly inhomo- 
geneous, nonnegative Lévy processes. Such objects and more general exponential 
functionals of Lévy processes, such as Brownian motion, have been extensively 
studied by probabilists with applications, for instance, to finance. The NTR mod- 
els also arise in coalescent theory, which has applications in genetics and physics, 


418 L F. JAMES 


as seen, for example, in [36], Proposition 26. See also [2] and [5]. Noting some of 
these connections, Epifani, Lijoi and Prünster [11] applied techniques from those 
manuscripts to obtain expressions for the moments of mean functionals of NTR 
models and, as we also do here, highlighted some of the connections to these areas 
outside of Bayesian nonparametric statistics. The mean functional can be described 
explicitly as 


(2) I = f ” tF(dt) = f j S(t) dt = Í e 10 qi. 


It is a significant object, which has interesting interpretations in a variety of fields. 
We describe how this process is related to the study of the Blackwell-MacQueen 
analogue derived from NTR processes. Moreover, we discuss how our work is 
closely related to the recent work of Gnedin and Pitman [18] on regenerative com- 
position structures. 


2. Spatial neutral to the right processes. Suppose that (T, X) are random el- 
ements on the Polish space 4 that have distribution F (ds, dx) for (s, x) € 4. Here 
we would like to extend the definition of an NTR process to model F(ds, dx) as 
a random probability measure such that its marginal F (ds, X) is an NTR process. 
While the representation in (1) is quite useful for calculations, it is not immedi- 
ately obvious how one can use this definition to extend an NTR process to 4. The 
known exception is the Dirichlet process that can be defined on arbitrary spaces. 
To do this, we first recall that if F is an NTR process on A, then its cumulative 
hazard A (ds) := F(ds)/S(s—) is a nonnegative Lévy process; in other words, A is 
a completely random measure (see [30]). This observation and alternative idea for 
modeling via cumulative hazards is due to the important work of Hjort [21]. Note 
that Z in (1) is also a completely random measure. Moreover, an important aspect 
of our results relies on the fact that there is one-to-one distributional correspon- 
dence between a particular Z and A. Specifically, if J, represents a random jump 
of A taking its values in [0, 1], then —log(1 — Jj) is the jump of a correspond- 
ing Z taking its values in R*. Hence if we initially model Z and A as completely 
random measures witbout a drift component and fixed points of discontinuity, they 
may both be represented as linear functionals of a common Poisson random mea- 
sure. Importantly, one then may give precise meaning to the distributional equiv- 
alences P(T € ds|F) = F(ds) := S(s—)A(ds) := e^ 75^ A (ds), where F is an 
NTR process. 

Our construction now proceeds by extending A and Z to completely random 
measures on 4, using a representation in terms of a Poisson random measure. Let 
N denote a Poisson random measure on some Polish space W = [0, 1] x 4 with 
mean intensity 


EUN (du, ds, dx)|v] = v(du, ds, dx) := p(duls) Ao(ds, dx). 


Here p is a Lévy density that will determine the conditional distribution of the 
jumps of A and Z. Furthermore, without loss of generality, we assume that 


SPATIAL NTR PROCESSES 419 


Jo up(du) = 1 and, hence, up(du) is a probability density function. The inten- 
sity v is chosen such that Ao(ds, dx) :— Fo(ds, dx)/So(s—) is by definition a 
hazard measure on 4, where Fo represents a prior specification for the distribu- 
tion of F on 4, and So is the corresponding survival function on Rt. See [31], 
A5.3, for formal details of hazard measures on abstract spaces. Note that in [28], 
Proposition 25.28, the hazard measure is also called a natural compensator of a 
random measure defined as ôy y. We denote the Poisson law of N with intensity 
v as P(dN|v). The Laplace functional for N, which plays an important role in our 
analysis, is defined as 


Ble Op] = | e" pia) =e 90, 
M 


where for any positive f, N(f) = fy f (x)N (dx) and $(f) = f, (1— e FR) x 
v(dx), and M denotes the space of boundedly finite measures on 'W (see [6]). 
A measure, say N, is boundedly finite if for each bounded set A, N(A) < oo. See 
also [28], Chapter 12, for a discussion of Poisson random measures and the unicity 
property of Laplace functionals. 

Now the specifications above imply that A (ds, dx) :— DA uN(du,ds,dx) isa 
completely random hazard measure on 4 with mean E[A(ds, dx)] = Ao(ds, dx) 
and there is a corresponding Z (ds, dx) :— hs [—log(1 — 4)|N (du, ds, dx). In par- 
ticular, — log S(t—) := Z(t—-) = f[—I1is < t}log(1 — u)]N (du, ds, dx). Now 
using these facts we define a spatial neutral to the right (SPNTR) random proba- 
bility measure on 4 as 


(3) P(T edt, X e dx|F) := F(dt,dx) = S(t—)A(dt, dx). 


Defining A(ds) :— A(ds, X), it follows that F(ds) :— S(s—)A(ds) is an NTR 
process and, furthermore, E[F (dt, dx)] = So(t—)Ao(dt, dx) = Fo(dt, dx). See 
Section 5 for more details. 


REMARK 1. Thechoice of 
(4) p(duls) Ao(ds, dx) = c(s)u ^! (1 — wW 9-1 du Ao(ds, dx) 


for c(s) a positive function yields a natural extension of Hjort's [21] beta cumula- 
tive hazard process to beta processes on 4. Equivalently, this specification defines 
beta-Stacy or beta-neutral distribution functions on 4. See [33, 41] and [21], Sec- 
tion 7A, for such processes defined on RT. The case of the Dirichlet process with 
shape parameter 0 Fo is obtained by choosing c(s) = 05S9(s—). Our construction 
of spatial NTR processes is influenced by the work of James and Kwon [27], who 
first gave an explicit construction of spatial beta-neutral processes on 4 via ratios 
of two independent gamma processes. 


420 L. F. JAMES 


REMARK 2. Given the specifications in (3), we extend this definition to in- 
clude prior fixed points of discontinuity ((51, w1), ..., (Sk, wx)} in 4 as 


(5) ds dx) = een Bi a = U) As, 


(l: S] <s} 


where A;(ds, dx) = A (ds, dx) + *_, Uj8s u, (ds, dx) is defined such that inde- 
pendent of A, U, are independent random variables on [0, 1] with distribution H, 
for j = 1,..., k. We call Fy a general spatial NTR process. 


REMARK 3. The log mapping that we use can be deduced, for instance, from 
[7] and [8], Proposition 2. This type of correspondence is actually noted, albeit 
less explicitly, in [21] and is also used in related contexts without specific mention 
of NTR processes; see, for instance, [36], Proposition 26. In particular, if t is a 
Lévy measure that specifies the conditional distribution of the jumps of Z, then by 
writing t(dy|s) :— t(y|s) dy and p(du|s) :— p(uls) du, the relationship between 
the Lévy measures of Z and A is described by 


t(y|s) — e? p(1— e? |s) for y € RT or 
p(uls) 2 (1—u) !c(—-log(1— u)s) foru € [0, 1]. 


Note that if o(du|s) :— p(du), then we say that the relevant processes are homo- 
geneous. 


3. Posterior analysis. Similar to the case of the Dirichlet process, we con- 
sider the following setup. Suppose that (7;, X,)]F are iid. pairs with com- 
mon distribution F for i — 1,...,n and suppose the law of F is modeled as 
a spatial NTR process. This description yields a joint distribution of (T, X) — 
(03, X1), ..., (In, Xn)} and F. We are interested in the Bayesian disintegration 
of this joint distribution in terms of the posterior distribution of F|T, X and the 
marginal distribution of (T, X). Since A, Z and F are all functionals of N, we 
work instead with the joint distribution of (T, X, N), 


n 
n S(T;—)A(dT,, ax) [eani 
ic 
(6) 

—(dN|T,X).M(dTi,dXi,...,dTa, d X4), 
where zt (dN |T, X) denotes the desired posterior distribution of N|T, X and 


M (dT, dX) = M(dTi, dX, ..., dTa, d X4) 
(7) 


= [| VL rem. axo anto 
px] 


SPATIAL NTR PROCESSES 42] 


is the important exchangeable marginal distribution of (T, X). The M denotes the 
general analogue of the Blackwell-MacQueen Pólya urn, and hence is crucial to 
both theoretical understanding and practical implementation of the general class 
of spatial NTR processes. We will describe the posterior distribution given (T, X) 
in Section 4, and give a detailed analysis of M and related quantities in Section 5. 
We first explain some key elements of the analysis. 


3.1. The role of random partitions and order statistics. Itis clear that one can 
always represent (T, X) as (T*, X*, p), where, using notation similar to Lo [32], 
(I*, X = (T1, X), Tn n (p)' X1»! denotes the distinct pairs of observa- 
tions within the sample and where p = {E1, ..., Encp)} stands for a partition of 
(1,..., n} of size n(p) x n that records which Obs aus within the sample are 
equal. The number of elements in the jth cell, E, := (i: (7;, Xi) = (Tf, X7)j, of 


the partition is indicated by e for j = 1,...,n(p), so that 177) e, =n. It follows 
that the marginal distribution of (T, X), say M, can be expressed in terms of a 
conditional distribution of T, X|p, which is the same as a conditional distribution 
of the unique values T*, X*|p, and the marginal distribution of p. The marginal 
distribution of p, denoted as zx (p) or p(é1,..., en(p)), is an exchangeable parti- 
tion probability function (EPPF), that is, a probability distribution on p which is 
exchangeable in its arguments and depends only on the size of each cell. The best 
known case of an EPPF is the variant of the Ewens sampling formula (ESF) (see 
[1, 12]) associated with the Dirichlet process with total mass 0, given as 


gr (9) n(p) 
T(0-4n) n) ; I] re: 


Additionally, since a Dirichlet process is a special case of what are called species 
sampling models, the distribution of T, X|p is such that the unique pairs (T*, X7) 
are i.1.d. with distributions Fo. We note that the marginal distribution and, naturally, 
the posterior distribution of the Dirichlet process depend only on the counts e; and 
the unique values. The structure of M for general NTR processes is considerably 
more complex. However, as we will explain, what is interesting is that they do have 
a natural interpretation in terms of classical survival models. One can think of T* 
as the collection of the unordered distinct times to death of individuals in a sample 
of size n. In this sense, the count e, represents the number of deaths at time T*. 
Additionally, it is well known that the posterior distribution of NTR processes 
also depends on the number at risk at a given time, say t, which can be defined 
as Yn(t) = 95. (T; > t). We have discovered that to simplify the expressions 
for M, it is necessary not only to know the number of deaths, but also to know the 
number at risk at the unique times. Instead of working directly with T*, we do this 
by using its ordered values. 

That is, let Ta: n) > Io. Ay Ld Tip): n) denote an ordering of the unique 


values {T;*,... De) This colisétion represents the ordered unique times of 


422 L. F. JAMES 


death. Note that we work with the pairs (7(,.n), X7), where X? is simply the 
unique value treated as the concomitant of 7(j.,. That is, we do not order the X 
values; in fact, some spaces X do not have a natural ordering. Associated with this, 
let m = ([E(5,..., Ea} denote the collection of sets Eg) = (i: T, = Ty ny} 
for j = L...,n(p). That is, Eg) is the collection of values equal to the jth 
largest unique death time. Similar to e,, let m, = |E)! denote the number of 
deaths at the jth largest unique death time, 7(;:n), for j = 1,...,n(p). There 
are of course n(p)! possible orderings of T*. This implies that given a partition 
p = (£r... Enq)}, the collection [mi,...,7ma(g)] [resp. (m)] takes its values 
over the symmetric group, say Sp(p), of all n(p)! permutations of (ei, ..., esq) 
[of ((E1, ..., Enqpy})]. Notice now that, for each s, 


n n(p) n(p) 
Yas) =J HT, > s}= 9 ejI(T? > 5] — >) ml {Te n>}. 


i=l j=l l=] 


Hence for j = 1,...,n(p), we can define r;..; := Yn (Ty .n)) = 3 mı, which 
denotes the number larger than the jth largest unique value. Note that ro = 0 and 
Tn(p) = n; additionally, r, — r;—1 +m j. What is important is that, unlike p, the col- 
lection (Eq1), ..., Enp} completely determines (r,) via the (m ,); that is, m con- 
tains the relevant information in p. We will often refer to (m, p) rather than m to 
remind the reader of the dependence of m on p. 


REMARK 4. See [34, 37, 38] for a general overview of the EPPF concept and 
see [23, 24, 26] for its relevance to general marginal exchangeable distributions 
that arise in a Bayesian context. 


REMARK 5. One of the earliest applications of the Ewens sampling formula is 
in population genetics. It is quite interesting to note that, as described by Donnelly 
and Joyce ([10], page 230), one may also interpret the (7(, .n)) as the ordering of 
genetic types (alleles) of individuals, where new alleles arise by mutation and the 
alleles present in the population or in a sample at a given time may be ordered by 
age. An interesting by-product of our work is that it actually yields the explicit 
distribution for large classes of such models. A simple description will be given in 
Proposition 5.2. 


4. The posterior distribution of spatial NTR processes. In this section we 
describe formally the posterior distribution of spatial NTR processes given the data 
(T, X). Note here that we will characterize the posterior via the ordered values 
rather than T*. Since we are conditioning on (T, X), these are equivalent notions. 
We first describe the result for no fixed points of discontinuity and then discuss 
how one easily obtains the extension in Section 4.1. The proof is delayed until the 
Appendix. 


SPATIAL NTR PROCESSES 423 


PROPOSITION 4.1. Let F be a spatial NTR process defined by the Poisson 
random measure N with mean intensity v(du, ds, dx) = p(duls)Ao(ds, dx); A is 
its e Lévy hazard measure. Suppose that (T,, X;)|F are i.i.d. F for 
j—1,...,n. Then: 

(i) The posterior distribution of N|T, X is equivalent to the distribution of the 
random measure N* = N, + Do 5 tomo where, conditional on (T, X), 
N, is a Poisson random measure with intensity 
(8) vs (du, ds, dx) = (1 — u) ^? p(du|s) Ao(ds, dx). 

Additionally, the (Jj. n) are conditionally independent of Nn and are mutually in- 
dependent with distributions specified by 

P(Jj.n E du|T,, : n)) ou) (1— u) 1-! p(du|T(; : ») 
for j —1,...,n(p). 

(ii) The posterior distribution of A given (T, X) is equivalent to the law of the 
Lévy hazard measure, 


i 
A* (ds, dx) = | uN? (du, ds, dx) 
0 


n(p) 
= An(ds, dx) + ) Sindy sj (ds, da) 
j= 
where An(ds, dx) = D^ uN, (du, ds, dx) is a Lévy hazard measure with Lévy mea- 
sure as in (8) and where the (J, n) are conditionally independent of An. 


(iii) The posterior distribution of the corresponding Z process is equivalent to 
the the law of the random measure 


n(p) 
Z* (ds, dx) = Zn (ds, dx) + 2. Zj nôTy „.x* (ds, ax), 
J^ 
where Z,(ds, dx) = fg [-- log(1 — w)] Ns (du, ds, dx) and each Z, n = —log(1 — 
J, n) with distribution 
P(Zj a € dy|TYj-ny) = HF (da — e ?)) x (1 — e) e- c(dy|T(j. ny). 
(iv) Additionally, the posterior distribution of F is equivalent to the conditional 
law, given (T, X), of the random probability measure F7 (ds, dx) expressed as 


n(p) np) _ 

udi l| a- Ipa) «s dx) 9 | Pj nón, s xs (ds, dx), 
U: Ty mss] j=l 

where P, pag —£n (TG DJ; P. js = a (1 — Ji n). It follows that the Bayesian pre- 


diction rule is given by E iF 2 (ds, dx)|T, X], which can be expressed in several 
Ways. 


424 L. F. JAMES 


REMARK 6. Note that due to symmetry, one has the equivalence in distribu- 
tion of 


n(p) n(p) 
2 5 J jar, nxt (ds, dx) = Y, J} pôr» xs (ds, dx), 
j=l j=l 


where the random variables Fn) are mutually independent with marginal dis- 
tributions P(J*, € ds|T*) xu% (1 — u)” a) p (du|T*). Recall that Y, (Tọ :n)) = 
Fi] 


4.1. Remarks on prior fixed points of discontinuity. We have so far omitted 
any discussion on the form of the posterior distribution when there are prior points 
of discontinuity as in A, defined in (5). In fact, the analysis is essentially al- 
ready contained in our results. Recall that for n > 1 the posterior process for 
A in the complete data is AF = A, + 3m Jj nÓT( y X1: where the (Jj n) are 


conditionally independent of A,. Using the fact that A, and A* are the same 
structurally, one can simply let n(p) play the role of k and let (Uj, s;, w;} play 
the role of (J,,4, Ttj n. X7). Let n; = |(i: (3, X) = Gi, wDH for L = 1,...,k. 
In addition, let (71 .n), X7) denote 0 < n(p) < n unique values distinct from 
(C51, w1), ..., Sk, We)}. Then it is easy to see that the posterior distribution of 
Ax is of the form 


k n(p) 


Aik sem An sp ` Ui. nÓs, w T 25 Jj,nÓm, ap X7 
i-i j=l 


where P{U;.n € dujsi} e u™ (1 — u)?" 50 H, (du) for 1 = 1,...,k. Note here we 
use Yn (s) = 5. HT > s). 


REMARK 7. Note that marginalizing over X, the result in Proposition 4.1 re- 
duces to the appropriate analogous results for NTR processes described in [9, 14, 
15, 21, 29]. However, we shall present a considerably streamlined and direct proof 
that uses a methodology applicable to a much wider class of random probability 
measures on abstract spaces. Note moreover that there is no analogue of Proposi- 
tion 4.1(i) appearing in those works. The distribution of F7 (co, dx) corresponds 
to the posterior distribution of a new class of random probability measures, which 
we discuss in more detail in Section 5.3. 


5. Analysis of NTR generalizations of the Blackwell-MacQueen distrib- 
ution. We now present a detailed analysis of the marginal distribution M and 
related quantities. We give details for Lemmas 5.1 and 5.2 in the Appendix. First 


SPATIAL NTR PROCESSES 425 


we introduce some additional notation. For a homogeneous p or t and for w > 0, 
let 


oo 1 
$(o) = t (1 —e79yc(dy) = Í (1 — (1 —u)®)p(du) 


= f v -uy f oa] du. 


This is the Lévy exponent defined by the Laplace transform of a homogeneous 
Z process. For integers (i, k), let 


CO : l 
Wels) = [eee (dyjs) = Í (1— (1 — ay) — uY p(duls). 
0 
In the homogeneous case, set V, y = Jo (1 — e ?!)e ?*c(dy) and note that for 
each j, 6(J) = Vo = fo (1 — e 2”) (dy). Finally, we define cumulants 
I 
m, ry (Pls) = f ui — u) p(duls) 


and 


] 
kmp (P) = f w" — yi p(du), 


Our first task will be to obtain a nice expression for the expectation of the prod- 
uct of survival functions that appears in (6). First notice that 


n n(p) n(p) 
i n s, * 7 | I] az | j | [| 5 mus" | 
i=] j=l j=l 
These equivalences lead to the following result. 


LEMMA 5.1. Let v(du,ds, dx) = p(du|s)Ao(ds, dx) be the mean intensity 
of a Poisson random measure N . Then 


n(p) nP) 7, 
e| II 505:59" J = eh" Pm a (5) Aotds) 


The expression reduces to | ..; exp(— fos ? Jn j-1(s)Ao(ds)) when there are no 
ties. 


Lemma 5.1 is instrumental in obtaining the following initial description of M. 
LEMMA 5.2. Let M denote the exchangeable distribution of (T, X) defined 


in (T). Then M (d Y, dX) can be expressed as 


n(p) 


n(p) Taj 
19 m,,r Ag(d 
I 279 Ym jury 4 G)À0( Pena) I] Ao(dT*, dX). 
j=l [=] 


426 L. F. JAMES 


We now show how one can obtain calculations using M. For each m € Snip) 
and integrable function g(T), define 


n(p) 


OO 
L(g;m) = | [- 4 g((t, m)) Ile ie Vm jr 1 G) Ào(ds) 
0 Ín(p) t? 


J71 
X Km ,,r, 19 |t) Ao (dtj), 

where t; > f? > --- > tq) denotes one of n(p)! orderings of the unique values. 
With some abuse of notation, the vector (t, m) = (t) denotes the collection of n 
points whose n (p) unique values are ordered according to m. For example, suppose 
one has the function g(T1, 75, T3). Then in the instance where 7; = T? < T3, one 
has n(p) = 2 unique values and one evaluates ¢(7(2-2), T(2:2), Ta 2)) or, using the 
notation above, g(t, fo, t1). 

We now use L to obtain very general formulae for expected values of complex 
integrals of NTR processes. This plays a key role in obtaining the EPPF zr (p) and 
related quantities. 


LEMMA 5.3. Assume that the random functional 1(g) = f g(t [I7 ., F(dt) 
is integrable, where F is an NTR process specified by the Poisson law P(N |v). 
Then it follows from Lemma 5.2 that 


E[£(g)|v] = >| X. L(g; zl 
P Lme$,p) 
In the homogeneous case, p(du|s) = p(du), the expression reduces to 


k y» Lt Gas co 


P LmeS,y b y= 


PROOF. The result follows from an application of Fubini's theorem and 
Lemma 5.2, which yields 


I, L(e)P(dN|v) = J e(t) M(dt, dx). e 


REMARK 8. The case where g may depend also on X is obvious. It is impor- 
tant to note that Lemma 5.3 may viewed as a generalization of Lo ([32], Lemma 2). 


We now use Lemma 5.3 to obtain a simpler description of the distribution of 
(T, X), which also yields easily the EPPF formulae and a corresponding distri- 
bution on (m, p). Note again that we do this without resorting to the types of 
combinatorial arguments used, for instance, in [1] and [32]. 


SPATIAL NTR PROCESSES 427 


PROPOSITION 5.1. Let (T,X) denote the random variables with the ex- 
changeable distribution .M. described in Lemma 5.2. Then this distribution may 
be expressed in terms of a conditional distribution of T, X|m, p and a distribution 
of (m, p) as follows: 


(i) There exists a marginal distribution of T, X|m, p, given by x (dT, dX|m, p) 
proportional to 


TT LU Pn r Oms) TI 
| pi e "mpeg 0 Km r1 (o) I] Ao(dTi; : n), dx"), 
j=l [zi 


where Ti:n) > Ta:n) > -+ > Tan denotes the order statistics of the unique 
values T*. In the homogeneous case the result reduces to 


7z (dT, d X|m, p) 
ow n(p) n(p) — 
=| Tee» | Te Vm prj «ts o TTT a dTo. n), dX). 
=] j=! ES 


In both cases [9 Po(dX*|T( n)) is the conditional distribution of X|T, m. 
(ii) The distribution of (m, p), is described as follows. The EPPF derived by 
i.i.d. sampling from F is expressible as 


n(p- ». L(1;m). 


mc S, (p) 


The representations imply the existence of a joint distribution of (m, p) given by 
zt (m, p) = L(1; m). Additionally, in the case where p(du|s) = p(du), the formu- 
lae reduce to 


Hm Km ,,r;.. (p) n(p) mun (p) 
Drs D ELT T MI 
MES, (p) Izi $(r;) ppa or r;) 


PROOF. Statement (1) follows from (ii) and Lemma 5.2. The proof of (ii) in the 
general case follows from Lemma 5.3 with g := 1. In the case of p(du|s) = p(du), 
7t (p) is equivalent to 


n(p) oo (D "T 
2, ig T wolf f -J [[& ^97 Ym- Ao(dt,). 
n(p) 2 


The result is concluded by evaluating fj" fj^. -- 
Ao(dtj). This is done by noting that for any positive C, je aiii id 
C7le~CA0® Tn addition, Wm ,0 = $ (n1), and for each j, O(7)-1) + Vm, r, 1 = 
$(r,). LJ 


“J. d IDs —AoG))¥m r 


X 


Il 


428 L. F JAMES 


Equation (10) in Proposition 5.1 can be used to deduce an explicit Markov prop- 
erty in the homogeneous case that has the interpretation that the distribution of the 
next death time only depends on the previous death time. Moreover, it demon- 
strates that it is fairly simple to sample from (10). 


PROPOSITION 5.2. Given (m, p), let T:n)» ---, Taq): n) be distributed ac- 
cording to (10). Moreover, set Ao(t) = t. Then, conditional on T(j1:5), ---, Taq); 
the distribution of Tij n) depends only on T(j4.3) = tj4.1 and is given by the trun- 
cated exponential distribution with density 


P(Ty n) €dt,ITiim = tji) =O e PO 4 qr, 


for t, > tj44. In particular, the smallest value, or equivalently the first of n(p) 
death times, T(n(p):n), has a marginal distribution that is exponential with para- 
meter à (n), that is, 


P(Tincp):n) € dy) = p(n)? "Y dy. 


5.1. Some connections to exponential functionals and means of NTR processes. 
We now relate some of our results to those of Epifani, Lijoi and Prünster [11] 
and Carmona, Petit and Yor [5] concerning moment formulae for means of NTR 
processes. Briefly, using the relationship in (2), Epifani, Lijoi and Prünster ([11], 
Proposition 5) established the following moment formulae, expressed in our nota- 
tion, that characterizes the distribution of 7: 


guten zu [e-f [a-e 
) L 


(12) 
x e 0-P c (dy) Ao(ds) dt,. 


The authors also provide conditions under which the moments exist, which 
amounts to the finiteness of the moment of order n of Fo; that is, [y t" Fo(dt) « co. 
In addition, when p(du|s) = p(du) and Ag(t) = t, the expression in (12) reduces 
to the interesting formulae of Carmona, Petit and Yor ([5], Proposition 3.3), viewed 
within the context of exponential functionals of a subordinator, 


n! 
(13) EQ” |v] = —————. 
Notice that the specification Ao(f) = t is equivalent to specifying Fo as an expo- 


nential(1) distribution. In addition, Carmona, Petit and Yor ([5], Proposition 3.1) 
establish the following result for any à > 1 and more general Lévy processes: 


X 
ELA] = —— Ef. 
[1^] $09 [ ] 


SPATIAL NTR PROCESSES 429 


Lemma 5.3 offers a complementary result to theirs in that one can express IE[7"|v] 
in terms of sums over partitions p. Apparently, for NTR processes, a result of this 
type is only widely known in the case of the Dirichlet process, which follows as a 
special case of Lo [32]. The result is as follows. 


COROLLARY 5.1. Let I be defined as in (2). Then setting g(t) = [[].4 t = 
ID f in Lemma 5.3, one has I = £(g) and hence 


j=! 
E[J"|v] = xl X. LG: m) | 


In particular, in the case where p(du|s) = p(du) and Ag(t) = t, Lemma 5.3 com- 
bined with the result of Carmona, Petit and Yor [5] yields the identity 


Y| x [Ten 


P -me5,(9)-;—1 


OO roo oo (p) ! 
| | m, tym r n. 
0 Ín(p) £f j=l d LA $(J) 


Another relationship to the formula for IE[7" |v], (13), given in [5], is seen in the 
next corollary, derived from Proposition 5.1, which describes the formula for the 
case where all cells are of the same size. 


COROLLARY 5.2. Suppose that p(du|s) = p(du) and n = kn(p). Then with 
respect to the EPPF given in (11), the probability of the event p = (E,..., En}, 
such that the size of each cell is k, is 


—on(nt ITO) fo wkd — u)07P*p(du) 
ILS ego | 


As special cases, when n(p) = n, the probability of no ties in the sample corre- 
sponds to the probability of the event p = ((1), (2), ..., {n}} given by 


n' TT a fo «(1 — u)/~!p(du) 
n PD 
for E[I"|v] given in (13). When n(p) = 1, p = (1,2, ...,n] corresponds to the 
event that all the values in the sample are the same, and the probability is given by 
fo u"p(du) [y Q—e?Yyc(dy) 
p(n) fo? (1— e »)r(dy) 


x (p) 
x (p) = 


n 1 
= EL" |v] u(1— u) p(du), 
If 


m(p) = 


430 L. F. JAMES 


REMARK 9. The event of no ties, n(p) = n, corresponds to the common as- 
sumption in the literature for observed data. Analogous to Antoniak [1] for the 
Dirichlet process, it follows that when n(p) = n, using Corollary 5.2, the distribu- 
tion of T, X|p in the homogeneous case is 


n n 
sun et enl I] Ao(dT,, dX ,). 


i=l j=l 


REMARK 10. Gnedin and Pitman [18], independent of this work and by dif- 
ferent arguments, obtain formulae for what are called regenerative compositions 
that contain our results in (11). Their formulae are derived from a discretization of 
subordinators. In fact, the authors show that all such regenerative compositions are 
determined uniquely by their construction via subordinators. The authors’ result is 
more general, in the homogeneous case, because they include the result for subor- 
dinators with drift components. It is, however, a simple matter to adjust our results 
to allow for a drift (see [24], Remark 28). They do not cover the inhomogeneous 
cases we consider. We discovered these connections through a mutual exchange 
of manuscripts in progress. The authors’ description via a decrement function and 
composition structure contain additional binomial coefficients. Explicitly in terms 
of our notation, their composition structure is expressed as 


n! 
— ——— 7t (m, p). 


n(p) 
er ej! 


The authors identify some particularly interesting composition structures and we 
will show how this translates into an interesting class of spatial NTR models. See 
also [10, 17, 35] for relevant references. See also [19, 20] for important results 
related to the rates of various n(p). 


5.2. Sampling .M: modified Chinese restaurant processes. Propositions 5.1, 
5.2 and 4.1 dictate how one might sample (T, X) from M. This is especially true in 
the homogeneous case. One proceeds essentially by first obtaining a draw of (m, p) 
from x (m, p), then using Proposition 5.1 or 5.2 to draw the ordered unique val- 
ues (T; j: n)) from the relevant truncated exponential distributions. The X ; are then 
drawn from Po(d X"|T(j .&)) for j = l,...,n(p). Additionally one can then (ap- 
proximately) draw F|T, X, by using the representation F7 from Proposition 4.1, 
which suggests to draw (J, n), and then applying methods in the literature to ap- 
proximate quantities such as A, (see, e.g., [4]). These are precisely the type of 
steps that would lead to efficient approximations in more complex mixture mod- 
els, that is to say, models where (T, X) are missing values obtained from .M and 
are not directly observed. Also, by sampling from .M one can approximate quan- 
tities such as those that appear in Lemma 5.3. In this section it is shown how one 


SPATIAL NTR PROCESSES 431 


might generate (m, p) from x (m, p) in the case where p(du|s) = p(du) via a se- 
quential seating scheme with probabilities derived from the prediction rule given 
(m, p). This idea also holds in the nonhomogeneous case. The scheme bears sim- 
ilarities to generalized Chinese restaurant processes that can be used to generate 
general EPPFs, x (p) = p(é1,..., €n(p)). Using the description in [37], page 60, the 
generalized Chinese restaurant scheme assumes that an initially empty Chinese 
restaurant has an unlimited number of tables labeled 1,2,.... Customers num- 
bered 1, 2, ... arrive one by and are seated sequentially according to probabilities 
derived from ratios of the EPPF. Basically customers are seated with probabili- 
ties that depend on the size or number of customers already seated at the existing 
tables. 


5.2.1. Ordered generalized Chinese restaurant processes. In general, to draw 
from p(mi, ..., mga(p)) :— 7 (m, p), we introduce a new scheme, which is a modi- 
fied Chinese restaurant process that also records the rank of the entering customers 
relative to the already seated customers. The first customer is seated and assigned 
an initial rank of 1. Now, given a configuration based on n customers seated at 
n(p) existing tables labeled with ranks from j = 1,...,n(p), the next customer 
n + lis seated at an occupied table j, denoting that customer n + 1 is equivalent 
to the jth largest seated customers, with probability 


p(....m,+1,...) 


P A = = 
^" p(m,..-, mat) 
(14) 


__ Mm, tl, rj- (0) TI Km,,rj.. +1 (P) n(p) $0) 
Km ry) [iacu Kmy,ri_1 CO) ja , Or +1) 5 1) 


Customer n + 1 is seated at a new table with probability 1 — = Dj:n. However, 
if customer n 4- 1 is new, it 1s also necessary to know the misi s rank and as 
such to rerank by one position all customers smaller than the new customer. Hence 
the probability that customer n + 1 is new and is the jth largest among n(p) + 1 
possible ranks is 


DG ss mj d Tots.) 


Qj.n = = 
iid pQqni,...,ma(p)) 


N XO ri; Km) dr) 
— $(rj-14 1) Ir Kos sto) i; 6G 4-1) 


With dn(p)-1:n = K1,5(0)/$ (n + 1). Note that in the calculation of Kir, (P) 
r;—1 + listo be used rather than r; =r,_; +m}. 

As an example, consider the choice of a homogeneous beta process that corre- 
sponds to c(s) = 0 in (4). Then it is easily seen that $ (rj) = 3,7 ,0/(0 +1 — 1) 


432 L. F. JAMES 


and it follows that, in this case, 





n(p) 
Pyn my I] $ (rt) 


~ A+0 ory +1) 


LJ 
and 
1 1 "W plr) 


|n40 xi 1/(0 +i — 1) i=; piri +1) 


Qyin 


5.3. Species sampling models generated by spatial NTR processes. The avail- 
ability of the EPPF, coupled with Pitman’s [34] theory of species sampling random 
probability models, implies that there exists a new explicit class of random proba- 
bility measures of the form 


(15) PR f° $6285, 27 150/520. 
i=! 


where Z, are i.1.d. random elements in X with some nonatomic law Pp and where, 
independent of (Z,), (Q,) denotes a collection of random probabilities that sum 
to 1 and whose law is completely determined by the EPPF x (p) given in Propo- 
sition 5.1. We will call Pr an NTR species sampling model. We do point out that 
although there are technically a large number of possible species sampling mod- 
els, to date there are only two well-known classes: the species sampling models 
based on the Poisson- Kingman models described in [38] (see also [24]) and the 
stick-breaking models described in [22]. See also [40]. 

The NTR species sampling model, which is defined for the first time here, rep- 
resents a third case where, due to the present analysis, much is known. All three 
classes contain the Dirichlet process. In fact, rather remarkably, all three classes 
contain the two-parameter (o, 0) Poisson-Dirichlet family of random probability 
measures with parameters 0 < œ < 1 and 0 > 0. We will describe this in a forth- 
coming section. The next proposition describes how one can always formally ob- 
tain an NTR species sampling model generated by an F with an independent prior 
specification, Folds, dx) = Fo(ds) Po(dx). Moreover, we give a description of its 
posterior distribution. 


PROPOSITION 5.3. Let v(du, ds, dx) = p(duls)Ao(ds, dx) denote the mean 
intensity of a Poisson random measure N on W, where o is chosen such that 
Ao(ds, dx) = Ao(ds) Pg(dx). Then the corresponding spatial NTR process, F, 
generates an NTR species sampling model, Py, given in (15), by the representa- 
tions Pp(dx) :— F(oo, dx) = fo^ S(s—)A(ds, dx) or, equivalently, the marginal 
distribution of X — (X* , p) is given by 


n n(p) 
e| [] Pelax = x (p) | [ fX). 


r=] j=l 


SPATIAL NTR PROCESSES 433 


Additionally, the posterior distribution of Pr given (T, X), or just X, is charac- 
terized by Proposition 4.] and 5.1. Specifically, it is equivalent to the appropriate 
conditional laws of the random measure F7 (oo, dx). 


PROOF. Under the specifications Fo(ds, dx) = Fo(ds) Po(dx), M(dT, dX) is 
such that given p, the vectors T* and X* are independent, where X* has joint law 
[T5 Po(dX*). The result is concluded by integrating out T*. O 


It is interesting to note that while the Dirichlet process is an example of Pr, it 
also arises without the independence specification. In most cases z (p) will not be 
easy to work with directly; as such one can work with z(m, p). As an example, 
we present a description for the prediction rule of Pr given (X, m). I: will be 
clear that one can employ the ordered generalized Chinese restaurant algorithm in 
Section 5.2.1 to draw easily from a joint distributton of (X, m). 


PROPOSITION 5.4. Let Pr denote an NTR species sampling model defined 
by the choice p(du|s) = p(du). Suppose that X = (X1,..., Xn} given Pr are 
i.i.d. Pp. Then one can define a prediction rule for X441 given X, m as 


n(p) n(p) 
P(Xn41 € dx|X, m) = ( -5*.Pj | Po(dx) + ) | pj ndx+(dx), 
3-3 j=!) 


where (p, n) are given in (14). Note also that x(m, p) Ire Po(d X) is the dis- 
tribution of (X, m), which means that the distribution of X|m is such that the 
unique values (X H given m are i.i.d. Po. The prediction rule given X is obtained 


by P(Xn41 € dx|X) = Zmes, P(Xn4i € dx|X, m)z (mip). 
6. Examples. 


6.1. Generalized gamma models. An interesting class of measures is the fam- 
ily of generalized gamma random measures discussed in [4]. Using the description 
of Brix [4], these are Z processes with Lévy measure 


1 
js DP —a)” 


where ġa 5(1) = Lib + 1)* — (b)®]. The values for œ and b are restricted to sat- 
isfy 0 «a < 1 and 0 < b < œ or —œ < a x 0 and 0 < b < oo. Different choices 
for æ and b in pg,» yield various subordinators. These include the stable subordi- 
nator when b = 0, the gamma process subordinator when o = 0 and the inverse- 
Gaussian subordinator when œ = 1/2 and b > 0. When a < 0, this results in a 


Ta b (dy) Ao(ds, dx) = 7-1 exp(—by) dy ^o(ds, dx), 


434 L F. JAMES 


class of gamma compound Poisson processes. Generalized gamma NTR processes 
with b > 0 are discussed in [11]. Here, from our results, 


bus = I EB*- yn p 
ce a C 
[(r; - B)* — 6°) 


MOETET 
and 
Km rj 0) = Yo C7 (y) trc 
m [d Eb)" — 58] 
Hence 


E Do CONCI + rj- +D" 

[T9 (b 4 rye — be] 
The process F(ds, dx) is such that marginally F(ds, X) is a generalized gamma 
NTR process and Pr (dx) = F(oo, dx) is a species sampling model. Additionally 


one can use Proposition 5.2 to generate the T-n). In particular, when b = 0 and 
Ao(1) — t, the density corresponding to the stable process with index 0 < œ < 1 is 


7t (m, p) = 


P(To n) € dtj| T 1:5) = t,41) = pees fort, > tj+1. 

6.2. The spatial NTR two-parameter Poisson-Dirichlet model. We now de- 
scribe perhaps the most remarkable class of spatial NTR processes. Gnedin and 
Pitman ([18], Section 10) were able to deduce that one can generate the EPPF of 
the two-parameter (o, 0) Poisson—Dirichlet distribution with parameters 0 < œ < 1 
and 0 > 0 by specifying a homogeneous p such that 


|. T(042-—2o) 
(16) f rm ra — o)r(1-- 0) 


ui -uy 
and, hence, 
r;L (0 -- rj) (0 --2—a) 
l'(8 4- DP (0 a+r, -1) 
Due to Proposition 5.2, this is enough to generate the distribution of the (Tọ 5). 
Note for this model one can directly sample from the well-known EPPF. 


(r) = 


- 
r^ 


6.2.1. The ordered ESF and the Dirichlet process. An interesting case is when 
a = 0; that is, o(du) = 0(0 + 1)(1 — u)?~! du. This choice generates the ordered 
Ewens sampling formula as described in [10]. Moreover, the spatial NTR process 
F (ds, dx) is such that F(ds, X) is an NTR process but not a Dirichlet process, 
and it follows from Proposition 5.3 that Pr (dx) = F (co, dx) is a Dirichlet process 


SPATIAL NTR PROCESSES 435 


with shape 0 Po. Hence, this shows that a Dirichlet process may be generated via 
a homogeneous NTR process derived from a compound Poisson process. Note 
of course that when x € R, this process is marginally an NTR process in both 
coordinates. Setting Ao(t) = t, the corresponding distribution of the 7(, . n) is given 
by 


(0 + Dri (9.1) - 
P T, fe € dt T ; à =f = —— e r;/(0--r,))lt; [4a] 
(TG sn) € at, |TG+i:n) — £541) @+r)) 
for t, > £j. 
Note also that the distribution of the jumps that depends on (71, 55) is 


l'(0 4r, 4-1) 
r(m; + D)JT(0 t rj-1) 
that is, they are beta distributed with parameters (m, + 1,0 +r, 1). Note that 
these are not the jumps of a posterior Dirichlet process. However, since F (oo, dx) 


is a Dirichlet process, its posterior distribution given X is a Dirichlet process with 
shape 0 Po + 5 7. .1Óx,. 


P(Jj.n = du|T,, : »)) = u™ (1 m yet}, 


6.2.2. Representations for the general two-parameter (œ, 0) case. We can use 
the result above to provide some new results related to the two-parameter (a, 0) 
Poisson-Dirichlet family and spatial NTR processes. For clarity, we first recall the 
definition of the two-parameter (o, 0) Poisson-Dirichlet class of random proba- 
bility measures. The two-parameter (o, 0) Poisson-Dirichlet random probability 
measure with parameters 0 < o < 1 and 0 > 0 has the known representation 


LLa,0 (dx) 


Fo,o(dx) — 7 3 
a, 


where Ha, is a finite random measure on X with law P(dua« g) and where 
Tx,6 = Hao (X) 1s a random variable. The law of the random measure Hao o can 
be described as follows. When a = 0, 19,9 1s a gamma process with shape 0 Po; 
hence, Pog is a Dirichlet process with shape 0 Po. When 0 = 0, ua. is a stable 
random measure of index 0 < o < 1. Note that both eo and 19,9 are completely 
random measures and can be represented in terms of a Poisson random measure. 
However, this is not true for the case where both œ and 0 are positive. Here for 
0 <a «land > 0 one has the absolute continuity relationship 


T. 6 P(dpa,0) 


P(d ha,g) = E[T,; 6] 


where 7,9 is a stable law random variable. This class of models also has a repre- 
sentation in terms of stick-breaking processes. See, for instance, [22, 34, 39] for 
further details. We now arrive at the following interesting observations. 


436 L.F JAMES 


PROPOSITION 6.1. Let F (ds, dx) denote a spatial NTR process specified by 
the choice of p in (16) and let Fo(ds, dx) = Po(dx)Fo(ds). Then Pr is a two- 
parameter (a, 0) Poisson—Dirichlet process. This yields the representations 


Pays f eie dedi 


oo k—1 
= Y wT[Ta- voez (dx) 
k=] il 
= Ha, 9 (dx)/ Ta, 9 = Pa,o(dx), 

where (Vp) are independent beta (1 — o, 0 + ka) random variables independent 
of the (Zg), which are i.i.d. Po; that is, a two-parameter (a, 0) Poisson—Dirichlet 
process can be represented as the marginal probability measure of a spatial NTR 
process as described above. 


PROOF. The general result follows from an application of Proposition 5.3 
combined with the calculations of the EPPF using p in (16) by Gnedin and Pitman 
([18], Section 10). The case of the Dirichlet process that corresponds to the choice 
of p(du) = 0(0 + 1)(1 — uy*^! du could be deduced as well from [10] in combi- 
nation with Proposition 5.3. See also [35] for the (œ, æ) model. |] 


APPENDIX 


Proofs of Proposition 4.1 and Lemmas 5.1 and 5.2. We now show that 
the proofs of Proposition 4.1 and Lemmas 5.1 and 5.2 follow as a simple con- 
sequence of the Poisson partition calculus methods as laid out in [24, 26]. First set 
W, = (Ji, 7;, X,) fori — 1,..., n, elements of W. The collection J = (J1,..., Jn} 
with values in [0,1] will play the role of the latent jumps. Its unique values 
are the (Jj n). Set W = (J, T, X) and let W = (Jins Ty n. X7) denote the 
j=l,...,m(p) umque triples. Using Proposition 2.3 of [26] yields the following 
statement. Suppose that (W, N) are measurable elements in the space W” x MI, 
where N is Poisson random measure with sigma finite nonatomic mean measure v. 
Then for each nonnegative measurable f such that $( f) < oo, the following dis- 
integration holds: 


n vawo| e "OP(aN|v) 


i=] 
co n(p) 
=e OPAN vs, W) [|e Paw”, 
j=l 


where P(dN|v;,W) denotes the law of the random measure N + Owe, 
where N is a Poisson random measure with mean intensity E[N (du, ds, dx)|vs] = 
ve (du, ds, dx) :=e-F“5) v(du, ds, dx). 


SPATIAL NTR PROCESSES 437] 


To apply the results above we first express (9) in terms of an exponential func- 
tional of a Poisson random measure as follows. For each j, set fr, ,, (u,5, x) = 
I(s < Tj: 4) E- log(1 — u)]. Now it follows that one can define 


n(p) 
fu(u, 5, x) = $ m, fn, y- Q5 x) = ~Yn(s) log — u) 


j=! 


n)— 


and hence one has 


(p) 
TI -N(m, fr, aa? e N Gn) 
i! S(T: ny E € == e 


j=! 


Note also that e7 fr 652) — (1 — u) ©) and eg fr Um To s Xp) — (1 — Jn) 77! for 
j=1,...,n(p). 

The oe step is to write A(dT,, d X,) = Jn J, N(dJ,, dT,, d Xj). Now remov- 
ing those integrals in (6) yields an augmentation of the distribution of (T, X, N) 
in terms of a distribution of (J, T, X, N). It follows that the distribution of 
(J, T, X, N) can be expressed similar to the left-hand side of (17) with f, in place 
of f as 


n(p) n 
f IE e | [[N(44.47, ax, |e MRAM) 


1—1 


Note that | [7 . RAT AE J”) Hence now applying the right-hand side of (17) 
one has that the a distribution of (J, T, X, N) is given by 


P(dN Je v; , Wye bn) 
oo np) n(p) 
= | Il Jin (1— Jj P yalTo:n)]| I] Ao(T;", Xp, 

j=l [==] 
where EE[e ^" U9)|y] = e~#U) and now IP(dN |vy, , W) corresponds to the law, for 
fixed W, of a random measure Ny + s ÖJ, m Ty n)» xt where N, is a Poisson 
random measure with mean described in (8) and P(dN|vy,, W) is the posterior 
distribution of N|J, T, X. The joint distribution of (J, T, X) is obtained by inte- 
grating out N in (18). Now using the fact that one can decompose (J, T, X) as 
((J5,5), T*, X*, p), it follows that an expression for the marginal distribution of 
(T, X), or equivalently (T*, X*, p), is obtained by integrating out N and the (J, ,). 
For clarity, this takes the form 


n(p) 
M(dT, dX) = e $02? T Km rji (PITY: »)| | [ ^od, xp. 
yel [=] 


438 L. F. JAMES 


The description of the posterior distribution of N|T, X is given in terms of the 
distribution of N|J, T, X mixed over the distribution of the (J, n) given (T, X). 
The distribution of (J;,,) follows by an appeal to the classical Bayes rule; that 
is, one integrates out N in (18) and then divides the remaining quantity by M. 
This yields the results in Proposition 4.1. Now it follows that the description of 
M given in Lemma 5.2 is completed by verifying Lemma 5.1. This is obtained by 
using repeatedly the exponential change of measure described in Proposition 2.1 
of [26]. This is the same as working with (17) after removing all the terms that 
involve W; that is, the disintegration e P(N |v) = P(dN |vg)E[e NY |v]. 
We apply this repeatedly to the measure TS e " "fro »-) P(A N|v). To see 
this, first set g; :— m j fr, m- for j = 1,..., n(p) and let each g, now play the role 
of an f. We demonstrate the first two steps. Notice that the first term is obtained 
as 


Ti n 
e "6DP(dN|v) = P(dN|vg)e" ^» Sn Vm rg (5) Ao(ds) 
= P(d N |vg )E[e ^ 8 |v]. 
The next term is obtained as 
= _ TQ) n) 
F NGDP(dN|vg,) = P(4N |vg,.1.g;)e fo^ Ymar G)Ao(ds) 


The last expression follows from the fact that for s < T(2:5), eg QGwsx) — 
(1 — u)2 and e 81642) — (1 — u)™! with rj = m1. The next term would then ex- 
ploit this type of relationship for g1, g2 and g3 ons < To n), where m; +m2 — r». 
It is clear that continuing in this way leads to the conclusion of Lemma 5.1. 


REMARK 11. More details, including an analysis of semiparametric mod- 
els subject to censoring mechanisms, is given in an older version of this manu- 
script [25]. 


Acknowledgments. I would like to thank Kjell Doksum for early comments 
that helped in the exposition of this work. Thanks to Jim Pitman for clarifying 
some nice connections to his and Alexander Gnedin’s work. 


REFERENCES 


[1] ANTONIAK, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian non- 
parametric problems. Ann. Statist. 2 1152-1174. MR0365969 

[2] BERTOIN, J. and Yor, M. (2001). On subordinators, self-similar Markov processes and 
some factonzations of the exponential variable. Electron. Comm. Probab. 6 95-106. 
MR1871698 

[3] BLACKWELL, D. and MACQUEEN, J B. (1973). Ferguson distributions via Pólya urn schemes. 
Ann. Statist. 1 353—355. MR0362614 

[4] Brix, A. (1999). Generalized gamma measures and shot-noise Cox processes. Adv. in Appl. 
Probab 31 929—953. MR1747450 


[Sj 


[6] 


[7 


bl 


[8] 


[9 


— 


[10] 
[11] 
[12] 
[13] 
[14] 
[15] 
[16] 
[17] 
[18] 
[19] 
[20] 
[21] 
[22] 
[23] 
[24] 
[25] 
[26] 


[27] 


[28] 


SPATIAL NTR PROCESSES 439 


CARMONA, P., PETIT, F. and Yor, M. (1997). On the distribution and asymptotic results for ex- 
ponential functionals of Lévy processes. In Exponential Functionals and Principal Values 
Related to Brownian Motion (M. Yor, ed.) 73-130 Biblioteca de la Revista Matematica 
Iberoamericana, Madrid. MR1648657 

DALEY, D J and VERE-JONES, D. (1988). An Introduction to the Theory of Point Processes 
Spnnger, New York. MR0950166 

Dry, J. (1999). Some properties and characterizations of neutral-to-the-right priors and beta 
processes. Ph.D. dissertation, Michigan State Univ 

DEY, J., ERICKSON, R. V. and RAMAMOORTHI, R. V. (2003). Some aspects of neutral to right 
priors. Internat. Statist. Rev. 71 383—401. 

DOoKSUM, K. A. (1974). Tailfree and neutral random probabilities and their postericr distribu- 
tions. Ann. Probab. 2 183—201 MR0373081 

DONNELLY, P. and JOYCE, P. (1991) Consistent ordered sampling distributions: Characteriza- 
tion and convergence. Adv. in Appl. Probab. 23 229—258. MR1104078 

EPIFANI, I., LUOI, A. and PRUNSTER, I. (2003). Exponential functionals and means of neutral- 
to-the-right priors. Biornetrika 90 791—808. MR2024758 

EWENS, W. J. (1972) The sampling theory of selectively neutral alleles. Theoret. Population 
Biology 3 87-112. MR0325177 

FERGUSON, T. S. (1973). A Bayesian analysis of some nonparametric problems. Aan. Statist. 
1 209-230. MR0350949 

FERGUSON, T. S. (1974). Prior distributions on spaces of probability measures. Ann. Statist. 2 
615—629. MR0438568 

FERGUSON, T. S and PHADIA, E. (1979). Bayesian nonparametric estimation based on cen- 
sored data. Ann. Statist. 7 163—186. MR0515691 

FREEDMAN, D. A (1963). On the asymptotic behavior of Bayes estimates in the discrete case. 
Ann Math. Statist. 34 1386-1403 MR0158483 

GNEDIN, A. V. (1997). The representation of composition structures. Ann Probab. 25 
1437-1450 MR1457625 

GNEDIN, A. V. and PITMAN, J. (2005). Regenerative composition structures. Ann. Probab. 33 
445—479. MR2122798 

GNEDIN, A. V., PITMAN, J. and Yor, M. (2006). Asymptotic laws for regenerative composi- 
tions. Gamma subordinators and the like Probab. Theory Related Fields. To appear. 

GNEDIN, A. V, PITMAN, J. and YOR, M. (2006) Asymptotic laws for compositions derived 
from transformed subordinators. Ann Probab. 34. To appear 

HJORT, N L. (1990) Nonparametric Bayes estimators based on beta processes ın models for 
life history data. Ann. Statist. 18 1259-1294. MR1062708 

ISHWARAN, H. and JAMES, L. F. (2001). Gibbs sampling methods for stick-breaking priors. 
J. Amer. Statist. Assoc. 96 161-173. MR1952729 

ISHWARAN, H. and JAMES, L F. (2003). Generalized weighted Chinese restaurant processes 
for species sampling mixture models. Statist. Sinica 13 1211-1235. MR2026070 

JAMES, L. F. (2002). Poisson process partition calculus with applications to exchangeable mod- 
els and Bayesian nonparametrics. Available at arxiv org/abs/math.pr/0205093. 

JAMES, L. F. (2003) Poisson calculus for spatial neutral to the right processes. Available at 
ihome ust.hk/~lancelot. 

JAMES, L. F (2005) Bayesian Poisson process partition calculus with an application to 
Bayesian Lévy moving averages Ann. Statist. 33 1771-1799. MR2166562 

JAMES, L. F. and Kwon, S. (2000) A Bayesian nonparametric approach for the ‘ount distri- 
bution of survival time and mark variables under univariate censoring. Technical report, 
Dept. Mathematical Sciences, Johns Hopkins Univ 

KALLENBERG, O. (2002). Foundations of Modern Probability, 2nd ed. Springer, New York. 
MR1876169 


440 L. F. JAMES 


[29] KIM, Y. (1999). Nonparametric Bayesian estimators for counting processes. Ann Statist. 27 
562-588. MR1714717 

[30] KINGMAN, J F. C. (1993). Poisson Processes. Oxford Univ. Press, New York MR1207584 

[31] LAST, G. and BRANDT, A. (1995). Marked Point Processes on the Real Line: The Dynamic 
Approach. Springer, New York. MR1353912 

[32] Lo, A. Y. (1984) On a class of Bayesian nonparametric estimates. I. Density estimates. Ann. 
Statist 12 351—357. MR0733519 

[33] Lo, A. Y. (1993). A Bayesian bootstrap for censored data. Ann. Statist. 21 100—123. 
MR1212168 

[34] PITMAN, J (1996). Some developments of the Blackwell~MacQueen urn scheme. In Statistics, 
Probability and Game Theory (T. S Ferguson, L S. Shapley and J. B. MacQueen, eds.) 
245-267. IMS, Hayward, CA MR1481784 

[35] PITMAN, J. (1997). Partition structures derived from Brownian motion and stable subordina- 
tors. Bernoulli 3 79-96. MR 1466546 

[36] PITMAN, J. (1999). Coalescents with multiple collisions. Ann. Probab, 27 1870-1902. 
MR1742892 

[37] PITMAN, J. (2002). Combinatorial stochastic processes. Technical Report 621, Dept. Statis- 
tics, Univ. California, Berkeley. Available at stat- www.berkeley.edu/users/pitman/bibliog. 
html. 

[38] PITMAN, J. (2003) Poisson—Kingman partitions. In Statistics and Science: A Festschrift for 
Terry Speed (D. R. Goldstein, ed ) 1-34. IMS, Beachwood, OH. MR2004330 

[39] PITMAN, J and YOR, M. (1997). The two-parameter Poisson-Dirichlet distribution derived 
from a stable subordinator. Ann. Probab. 25 855—900. MR1434129 

[40] REGAZZINI, E., LIOI, A and PRUNSTER, I. (2003) Distributional results for means of 
normalized random measures with independent increments Ann Statist. 31 560—585 
MR1983542 

[41] WALKER, S. and MULIERE, P. (1997). Beta-Stacy processes and a generalization of the Pólya- 
urn scheme. Ann. Statist. 25 1762-1780. MR1463574 


DEPARTMENT OF INFORMATION 
AND SYSTEMS MANAGEMENT 
HONG KONG UNIVERSITY 
OF SCIENCE AND TECHNOLOGY 
CLEAR WATER BAY, KOWLOON 
HONG KONG 
E-MAIL lancelot@ust hk 


The Annals of Statistics 

2006, Vol 34, No 1, 441—468 

DOI 10 1214/009053605000000804 

© Inshtute of Mathematical Statishcs, 2006 


NONSUBJECTIVE PRIORS VIA PREDICTIVE RELATIVE 
ENTROPY REGRET 


By TREVOR J. SWEETING,! GAURI S. DATTA? AND MALAY GHOSH? 
University College London, University of Georgia and University of Florida 


We explore the construction of nonsubjective pnor distributions in 
Bayesian statistics via a posterior predictive relative entropy regret criterion 
We carry out a minimax analysis based on a derived asymptotic predictive 
loss function and show that this approach to pnor construction has a number 
of attractive features. The approach here differs from previous work that uses 
either prior or postenor relative entropy regret 1n that we consider predictive 
performance ın relation to alternative nondegenerate prior distributions. The 
theory is illustrated with an analysis of some specific examples. 


1. Introduction. There is an extensive literature on the development of ob- 
jective prior distributions based on information loss criteria. Bernardo [5] obtains 
reference priors by maximizing the Shannon mutual information between the pa- 
rameter and the sample. These priors are maximin solutions under relative entropy 
loss; see, for example, [3, 8] for further analysis, discussion and references. In reg- 
ular parametric families the reference prior for the full parameter is Jeffreys’ prior. 
It is argued in [5], however, that when nuisance parameters are present, then the 
appropriate reference prior should depend on which parameter(s) are deemed to be 
of primary interest. This dependence on parameters of interest is mirrored in the 
approach to prior development via minimization of coverage probability vias; see, 
for example, [11, 23, 25] for further aspects of this approach. 

In the present paper we explore the construction of nonsubjective prior distrib- 
utions via predictive performance. It is possible to use Bernardo's approach to ob- 
tain reference priors for prediction. However, as shown in [5], this program turns 
out to be equivalent to obtaining the reference prior for the full parameter, which 
produces Jeffreys' prior in regular problems. Further analysis along these lines 
is carried out in [17]. Datta et al. [12] explore prior construction using predic- 
tive probability matching, which is shown to produce sensible prior distributions 
in a number of standard examples. In the present article we follow Bernardo [5] 


Received March 2003; revised July 2005. 
| Supported ın part by EPSRC Grant GR/R24210/01. 
?7Supported in part by NSF Grants DMS-00-71642 and SES-02-41651 and NSA Grant MDA904- 
03-1-0016. 
3Supported in part by NSF Grant SES-99-11485. 
AMS 2000 subject classifications. Primary 62F15; secondary 62B10, 62C20. 
Key words and phrases. Nonsubjective Bayesian inference, predictive inference, relative entropy 
loss, higher-order asymptotics. 


44] 


442 T J SWEETING, G S. DATTA AND M. GHOSH 


and Barron [3] by taking an information-theoretic approach and using an entropy- 
based risk function. However, here we focus on the posterior predictive relative 
entropy regret, as opposed to the prior predictive relative entropy regret used by 
these authors. Our starting point is the predictive information criterion introduced 
by Aitchison [1], which was also discussed by Akaike [2] as a criterion for the 
selection of objective priors. We depart from these and other authors by taking a 
more Bayesian viewpoint, in that we are less concerned here with performance 
in repeated sampling but rather with performance in relation to alternative prior 
specifications. The main aim of the paper is to search for uniform, or impartial, 
minimax priors under an associated predictive loss function. These priors are also 
maximin, or least favorable, which can be interpreted here as giving rise to mini- 
mum information predictive distributions. 

The organization of the paper is as follows. We start in Section 2 by defining 
the posterior predictive regret, which measures the regret when using a posterior 
predictive distribution under a particular prior in relation to the posterior predictive 
distribution under an alternative proper prior. We define a related predictive loss 
function and argue that this is a suitable criterion for the comparison of alternative 
prior specifications. We discuss informally the results in Section 6 on impartial, 
minimax and maximin priors under a large sample version of this loss function. We 
also give a definition of the predictive information in a prior distribution. Through- 
out we make connections with standard quantities that arise in information theory. 
In Section 3 we relate posterior predictive regret and loss to prior predictive re- 
gret and loss and in Section 4 we obtain the asymptotic behavior of the posterior 
predictive regret, which is obtained via an analysis of the higher-order asymptotic 
behavior of the prior predictive regret. The higher-order analysis carried out in 
Section 5, which is of independent interest, leads to expressions for the asymp- 
totic forms of the posterior predictive regret, predictive information and predictive 
Joss. In Section 6 we investigate impartial minimax priors under our asymptotic 
predictive loss function. It turns out that these priors also minimize the asymptotic 
information in the predictive distribution. In the case of a single real parameter, 
Jeffreys' prior turns out to be minimax. However, in dimensions greater than one, 
the minimax solution need not be Jeffreys' prior. The theory is illustrated with 
an analysis of some specific examples, and some concluding remarks are given in 
Section 7. 

There are a number of appealing aspects of the proposed Bayesian predictive 
approach to prior determination. First, since the focus is on prediction, there is no 
need to specify a set of parameters deemed to be of interest. Second, difficulties 
associated with improper priors are avoided in the formulation of posterior predic- 
tive, as opposed to prior predictive, criteria. Third, the minimax priors identified 
in Section 6 arise as limits of proper priors. Fourth, these minimax priors are also 
maximin, or least favorable for prediction, which can be interpreted here as min- 
imizing the predictive information contained in a prior. Finally, and importantly, 
the same asymptotic predictive loss criterion emerges regardless of whether one is 


PREDICTIVE RELATIVE ENTROPY REGRET 443 


considering prediction of a single future observation or a large number of future 
observations. 


2. Posterior predictive regret and impartial priors. Consider a paramet- 
ric model with density p(-|0) with respect to a o-finite measure u, where 
0 = (0!,...,0P) is an unknown parameter in an open set © C RP, p > 1. Let 
p? (x) = f p(x|0) dr (0) be the marginal density of X under the prior distribution 
x on O, where both x and p* may be improper. Let II be the class of prior distri- 
butions xr satisfying p" (X) < oo a.s. (0) for all 0 € ©. That is, x € [I if and only 
if P°({X : p (X) < co}) — 1 for all 0 € O. 

We suppose that X represents data to be observed and Y represents future obser- 
vations to be predicted. Denote by p" (y|x) the posterior predictive density of Y 
given X = x under the prior x € II. Let Q C II be the class of all proper prior 
distributions on ©. For x € II and x € Q, define the posterior predictive regret 


p' lx) 
p” (ylx) 


We note that dyjx (T, 7) is the conditional relative entropy, or expected Kullback— 
Leibler divergence, D(p'(Y|X)||p*(Y|X)), between the predictive densities 
under r and t. See, for example, the book by Cover and Thomas [10] for defi- 
nitions and properties of the various information-theoretic quantities that arise in 
this work. It follows from standard results in information theory that the quantity 
dy|x (t, x) always exists (possibly +00) and is nonnegative. It is zero when x = t 
and is therefore the expected regret under the loss function — log p" (y|x) associ- 
ated with using the predictive density p" (y|x) when X and Y arise frem p' (x) 
and p* (y|x), respectively. 

When t = {6}, the distribution degenerate at 0 € ©, we will simply write 
dyjx (7, t) = dyjx (0, 7), where 


C.) — dyxGu m) ffi jr p (x, ») duc) duty), 


pGlx, 8) 
p? (yix) 


is the expected regret under the loss function — log p” (y|x) associated with using 
the predictive density p™(y|x) when X and Y arise from p(x|0) and p(ylx, 0), 
respectively. The regret (2.2) is the conditional relative entropy D(p(Y|X, 
Ollo” (Y | X)). The readily derived relationship 


(2.3) J àrx 6.0046) 9 dri Go) f avix(6, 40) 


implies that (2.2) is a proper scoring rule, as pointed out by Aitchison [1]; that 
is, the left-hand side of (2.3) attains its minimum value over x € II when 7 =T. 
We note that the final integral in (2.3) is the Shannon conditional mutual informa- 
tion / (Y; 0| X) between Y and 0 conditional on X (under the prior c). Conditional 
mutual information has been used by Sun and Berger [21] for deriving reference 


Q2 dmx6,- | leg] |p, yl6) diu) duly) 


444 T J. SWEETING, G. S. DATTA AND M. GHOSH 


priors conditional on a parameter to which a subjective prior has been assigned, 
and by Clarke and Yuan [9] for deriving possibly data-dependent "partial informa- 
tion" reference priors that are conditional on a statistic. 

Definition (2.1) of the posterior predictive regret is motivated by standard argu- 
ments for adopting the logarithmic score log q (Y) as an operational utility function 
when using q as a predictive density for the random quantity Y ; see, for example, 
the discussion in Chapter 2 of [6]. The criterion (2.2) was used by Aitchison [1] for 
the purpose of comparing the predictive performance of estimative and posterior 
predictive distributions, which was followed up by Komaki [16], who considered 
the associated asymptotic theory for curved exponential families. Hartigan [14] 
obtained related higher-order asymptotic expressions which he used to compare 
estimative predictive distributions based on (bias-corrected) maximum likelihood 
and Bayes estimators. Akaike [2] discussed the use of (2.2) for the selection of 
objective priors. A similar approach was also proposed by Geisser in his discus- 
sion of Bernardo [5]. Recently, Liang and Barron [19] have derived exact minimax 
priors under the criterion (2.2) for location and scale families. 

The criterion (2.1) extends the domain of definition of (2.2) from degenerate 
priors (0) to all proper priors r € $2. We argue that (2.1) is a suitable Bayesian 
performance characteristic for assessing the predictive performance of a nonsub- 
jective prior distribution a when @ arises from alternative proper prior distribu- 
tions r. There are two ways of thinking about this. First, we might be interested in 
the predictive performance of a proposed nonsubjective prior distribution under its 
repeated use, as opposed to its performance under repeated sampling, as measured 
by (2.2). From this point of view, we could consider the prior selection problem 
as an idealized game between the Statistician and Nature, in which each player 
selects a prior distribution. Àn alternative viewpoint is to consider (2.1) as measur- 
ing the predictive performance of z in relation to a subjective prior distribution c 
that is as yet unspecified. Thus, t might reflect the prior beliefs, yet to be elicited, 
of an expert. In this case the prior selection problem could be viewed as a game 
between the Statistician and an Expert. It is possible, of course, that the Statisti- 
cian and Expert are the same person, whose prior beliefs have yet to be properly 
formulated. 

Akaike [2] considered priors that give constant posterior predictive regret (2.2), 
referring to such priors as uniform or "impartial" priors. Such priors will only exist 
in special cases, however. Achieving constant regret over all possible priors t € X2 
in (2.1) is clearly never possible since, for any fixed x € II, the precision of the 
predictive distribution under v will tend to increase as r becomes more informa- 
tive, in which case dy|x(t, z) will eventually increase. Alternatively, since t is 
unknown, one might wish to consider the minimaxity of x over all t € $2. How- 
ever, the maximum regret will tend to occur at degenerate v. We would therefore 
be led back to the frequentist risk criterion (2.2), which is not the object of primary 
interest in the present paper. 


PREDICTIVE RELATIVE ENTROPY REGRET 445 


For these reasons, we will study the loss function 
(2.4) Ly\x(t, 1; n?) = dy|x (2, x) — dy|x(t, 2”), 


provided that this exists (see later), which is the posterior predictive regret asso- 
ciated with using the prior xt compared to using a fixed base prior x? e TI. Since 
we will be investigating default priors for prediction, it is necessary that our pro- 
cedure for choosing the base measure x” is such that pP(y|x) does not depend 
on the particular parameterization of the model that is adopted. We are therefore 
inevitably led to a choice of base measure that is invariant under arbitrary repara- 
meterization. In the case of a regular parametric family, an obvious candidate for 
xë is Jeffreys’ invariant prior with density proportional to |Z (8)|!/7, where I (0) 
is Fisher's information in the sample X. Since we will only be considering regular 
likelihoods in the rest of this paper, we take 2° = zr in the sequel and simply 
write Lyix (z, n; x^) = Ly|x (1, 2). 

Assume that the base Jeffreys’ prior m” satisfies dy|x (0, zÝ) « ooforalló € © 
and let p^ (y|x) be the conditional density of Y given X under 2’. Then the 
( posterior) predictive loss function defined by 


Ly|x (0, x) = dyix (0, x) — dyyx (0, 2”) 


(2.5) p! (ylx) 
«fl g| P | p 310) iG) dno) 


is well defined, although possibly +00. Now let Qy)x C Q be the class of proper 


priors 7 for which f dyix (0,7) dv (0) < oo. Then for x € II and t € Qy|x, we 
can define the expected predictive loss 





Ly\x(t,7) = | Lyix (6,7) dz (0) 


(2.6) = J arx. ace) - | àrx 9 a) 


= dy;x (t, x) — dy|x (v, x”), 


as in (2.4). Since t € X2y|x, the final line is well defined (possibly +00). 
Next we define, for t € Q2, 


p' lx) 
p! lx) 


Since the negative conditional relative entropy —dy)x(t, x7 ) = —D(p'(Y| 
X)|p^(Y|X)) is a natural information-theoretic measure of the uncertainty 
in the predictive distribution p*(Y|X), we will refer to £y|x(r) as the pre- 
dictive information in t. Here p'(y|x) acts as a normalization of the con- 
ditional entropy of p'(y|x). From relation (2.3) with m = z^, we see that 
tyix (0) < f dyjx (0,77 ) dv (0), from which it follows that sup,co Zyix (v) = 


AN tre) -drxG n) = f log] Li 2ydu dato. 


446 T. J SWEETING, G. S. DATTA AND M. GHOSH 


Supagcgo Syix({@}). That is, the maximum predictive information occurs at (or 
near) a degenerate prior. Thus, Zyjx (1) is a natural entropy-based measure of the 
information in the predictive distribution p*(y|x). Note that, again from (2.3), 
ty|x (1) < oo whenever t € Qyjx. 

It now follows from (2.3), (2.6) and (2.7) that, for x € IL and t € Qyjx, we can 
write 


(2.8) dy|x (t, 1) = Ly|x (7, x) + yix (1). 


We will explore priors for which Lyjx (0, x) is approximately constant in 0 € ©. 
Notice that if Ly|x (0, x) is approximately constant, then, from (2.8), dyjx (T, x) is 
approximately constant over all t having the same predictive information 7y|x (1). 
This therefore provides a suitable notion of approximate uniformity of the poste- 
rior predictive regret (2.1). 

In Sections 4 and 5 we will derive large sample forms, L(0, x), L(t, x), C (t) 
and d(t,7), respectively, of suitably normalized versions of Ly |x(6,7), 
Lyix (0, 7), £v|x (2) and dy|x (7, 7) and simply refer to L(0, x) as the predictive 
loss function. Importantly, for smooth priors x this asymptotic loss function will 
not depend on the amount of prediction Y to be carried out. In Section 6 we will 
investigate uniform and minimax priors under predictive loss. As is often the case 
in game theory, there is a strong relationship between constant loss, minimax and 
maximin priors. We give an informal statement of Theorem 6.1. An equalizer prior 
is a prior x for which the predictive loss function L(8, 7r) is constant over 0 € ©. 
Suppose that zp is an equalizer prior and that there exists a sequence tg of proper 
priors in the class C Q, to be defined in Section 4, for which d (tg, 29) — 0 as 
k — œ. Then Theorem 6.1 states that zo is minimax with respect to L(t, x) and 
C (Gro) = infzeo £(1); that is, 7o contains minimum predictive information about Y. 
This latter property is equivalent to mo being maximin, or least favorable, under 
L(t, x). Since by construction L(t, x’) —0 forall t € ®, x is automatically an 
equalizer prior. However, there may not exist a sequence tg of proper priors with 
d (xy, x7) — 0, in which case Jeffreys’ prior may not be minimax. Some examples 
will be given in Section 6. 

Although the focus of this paper is on the general asymptotic form of the predic- 
tive loss, we briefly note the implications of adopting either the posterior predictive 
regret (2.2) or the predictive loss (2.5) in the special case where the family p(.-|0) 
of densities is invariant under a suitable group 8. of transformations of the sample 
space. See, for example, Chapter 6 in [4] for a general discussion of invariant de- 
cision problems. Let 3 be the induced group of transformations on ©. Then the 
predictive loss (2.5) is invariant under 9 and the invariant decisions are invariant 
priors satisfying z(g(8)) x x (0)| d0/dg(0)| for all g € ĝ. If the group 9 is tran- 
sitive, then the predictive loss is constant for every invariant prior. Furthermore, 
if we consider the broader decision problem in which we replace p" (.|x) by the 
arbitrary decision function 5(x) = qx, where q,(-) is to be used as a predictive 


PREDICTIVE RELATIVE ENTROPY REGRET 447 


density for Y when X = x, then it can be shown that p*(y|x), the posterior pre- 
dictive density under the right Haar measure on 6, is the best invariant predictive 
density under the posterior predictive regret (2.2). Since x” is an invariant prior, 
it further follows that the right Haar measure is the best invariant prior under the 
predictive loss function (2.5). Since submission of the final version of the present 
paper, a careful analysis using (2.2) for location and scale families has appeared 
in [19]. 

Returning to the definition of the predictive loss function (2.4) relative to an 
arbitrary base measure 1”, we see that this is related to the expected predictive 
loss (2.6) by the equation 


Ly|x(t, x; mw?) = Ly|x (7, 1) — Ly|x(t, zx). 


Therefore, using zx P will give rise to an equivalent predictive loss function if and 

only if Lyjx (0, x: P) is constant in 8. In this case we say that 2? is neutral relative 
J 

tom”. 


3. Relationship to prior predictive regret. In this section we relate the pos- 
terior predictive regret (2.2) and loss function (2.5) to the prior predictive regret 
and loss function. We will use these relationships in Section 4 to obtain the asymp- 
totic posterior predictive regret d(t, 2) and loss L(t, 7). 

For x € II, we define the prior predictive regret by 


p(x|@) 
p* (x) 
which is the relative entropy D(p(X|0)] p" (X)) between p(x|0) and the prior 
predictive density p" (x). Note that zr may be improper in this definition. In that 
case, unlike the posterior predictive regret, alternative normalizing constants will 
give rise to alternative versions of (3.1), differing by constants. The prior predictive 
regret (3.1) is the focus of work by Bernardo [5], Clarke and Barron [7] and others. 
Now define Iy C II to be the class of priors x in I for which dx (6, 1) < oo for 
all 9 € ©. If x” € Ix, then for x € II we define the prior predictive loss by 


p! (x) 
p* (x) 





G.) dx(0, 1) = DPX) = | log] pote duco; 





GD Lx(@,n)=dx 9,7) — dx (6,7) = f tog | | petty due), 
which is well defined (possibly +00). 

The posterior predictive regret (2.2) and loss (2.5) are simply related to the prior 
predictive regret (3.1) and loss (3.2). The following result is essentially the chain 
rule for relative entropy. However, we formally state and prove it since, first, the 
distribution of X may be improper here and, second, we need to make sure that 
these relationships are well defined. 


448 T. J. SWEETING, G. S. DATTA AND M. GHOSH 


LEMMA 3.1. Suppose that x € Ty, y. Then x € IIx, dy|x (0, x) « oo for all 
0 € O and 


9.3) dy|x (0,30) — dx, y (6,7) — dx(0, 7). 
If further x? € YTy. y, then Lyjx (0, 1) < oo for all 8 € © and 
(3.4) Ly|x (0, n) — Lx,y (0,1) — Lx(0,n). 


PROOF. Since x c II, the marginal densities p” (X) and p" (X, Y) are a.s. (0) 
finite for all Ó € ©. Therefore, 


p Gy» | po.» dn ($) = p*G) | pole. ydp" (elo) = p" GOp" Ol), 


since, by definition, p(x|$) dx ($) = p" (x) dp” ($|x). It now follows straightfor- 
wardly from the definitions (2.2) and (3.1) that 


(3.5) dx y (0, x) = dyjx (0, x) + dx (0,77). 


Since r € IIx y, it follows from (3.5) that both dy;x (0, 7) < oo and z € [Ty and, 
hence, relation (3.3) holds. Since x € My and x” e Ty, it follows from (3.2) 
that Lyjx (0, 7) is finite for all 0. Finally, since x’ € TI, we have p^ x, y)= 
p! Gp (y|x) and relation (3.4) follows straightforwardly from the definitions 
(2.5) and (3.2). 

Finally, let Qy C Q be the class of priors t in Q satisfying / dx(0, 
a’) dt(@) < oo. It follows from equation (3.3) of Lemma 3.1 that z!«c My y 
and r € Qy,y imply that f dyix (0, z^ ) dv (0) < oo, Tt € Qy and 


J àrx 9.46) = f dx (9 a) - [ax ace). 


Therefore, if x € Ily y and 1 € Qy y, then the expected posterior loss Lyjx (1, 7) 
at (2.6) is well defined. |] 


4. Asymptotic behavior of the predictive loss. "Throughout the remainder 
of this article we specialize to the case X = (X1,..., Xn) and Y = (Xn41,..., 
Xn+m), Where the X, are independent observations from a density f(x|0) with 
respect to a measure u. In the present section we investigate the asymptotic be- 
havior as n — oo of the predictive loss function (2.5). In particular, we will show 
that, under suitable regularity conditions, the asymptotic form of (2.5) (after suit- 
able normalization) is the same regardless of the amount m of prediction to be 
performed. This leads to a general definition for broad classes of priors zt and t of 
the (asymptotic) predictive loss L(t, x), information ¢ (t) and regret d(r, 7). 

For an asymptotic analysis of the posterior predictive regret (2.2) and loss func- 
tion (2.5), from (3.2), (3.3) and (3.4), we see that it suffices to study the as- 
ymptotic behavior of the prior predictive regret dx (0,2). Suppose that x «€ II 
has a density with respect to Lebesgue measure. For notational convenience, 


PREDICTIVE RELATIVE ENTROPY REGRET 449 


in what follows we will use the same symbol x to denote this density. Let 
1(@) = n-llog p(X|8) = n! Y log f (X,|0) be the normalized loglikelihood 
function and let i(0) = E? (—1"(8)) = n^! (0) be Fisher's information per obser- 
vation. A standard result for the prior predictive regret (3.1) when x is a density 
(see, e.g., [7]) is that, under suitable regularity conditions, 


p, (n i17 | 
4.1 dy (0,72) = —log| —— } + log; ———— 1 
(4.1) x6.) =F log( 7 ) +o 5 | + 0 
as n — oo. [Here the x appearing in the first term on the right-hand side of (4.1) is 
the usual transcendental number and should not be confused with the prior zr (-).] 
Taking Jeffreys’ prior to be z7 (0) = |i(6)|!/, it follows from (3.2) and (4.1) that 
the prior predictive loss satisfies 


li (9| 7 


Lx(0, n) =log| x (8) 


| + o(1). 

It now follows from (3.4) that, for any sequence m = m, > 1, Lyjx(0, 1) = 
o(1); that is, to first order the posterior predictive loss is identically zero for every 
smooth prior zr. It is therefore necessary to develop further the asymptotic expan- 
sion in (4.1). Let Ó denote the maximum likelihood estimator based on the data X 
and assume that the observed information matrix J = —ni" (Ô) is positive definite 
over the set S for which P? (S) = 1 + o(n^.), uniformly in compact subsets of 8. 

Let IIo be the class of priors x € II for which x € Iy for all n and let 
C C IIg be the class of priors in TIe that possess densities having continuous 
second-order derivatives throughout ©. Then, under suitable additional regularity 
conditions on f and x € C to be discussed in Section 5, the marginal density of X 
1S 


p” (x) = (208%)? | 7 T pea (0) (1 + on’), 


where 3s =(1 + bg)? is a Bayesian Bartlett correction, with bg = O (n^; see, 
for example, [22]. Therefore, we can write 


px p n li (0|! 7 é p 
log p) | = a oslas) + og| x (0) | RN [nó cute 4 


zx(8)) 1 |J| 1 
ME i z1] HO | +o(z) 
Since E?[n(I(8) — 1(0)}] = ps2 (0)/2 + o(n—!), where s2 (0) = {1 + br (0)} isa 
frequentist Bartlett correction, with bp (80) = O(n7!), it follows from (3.1) that 








ij (0172 
zt (0) 





(4.2) dy (0,2) = P iog( =~ ji leg] 


i xu. 


450 T. J. SWEETING, G. S. DATTA AND M. GHOSH 


where 


hs (0,2) = p(E" (bp) + br (0)) + E" LE 


-ae feela] 


Under suitable regularity conditions, the leading term in (4.3) turns out to 
be O(n-Lb, since both the Bayesian and frequentist Bartlett corrections 
are O (n^ )), as are all the expectations on the right-hand side of (4.3). We will 
therefore suppose that h, is of the form 


D(0, x) 
2n 


(4.3) 





(44) hOn) = | | rO, 
where D(0, 7r) is continuous in 8 and the remainder term r, (0, x) satisfies one of 
the following three successively stronger conditions: 


R1. r4(0, 2) — o(n^) uniformly in compacts of ©; 

R2. ra(0, x)= O(n^?) uniformly in compacts of ©; 

R3. r,(0,z) = E(0,zx)n ? +0(n~*) uniformly in compacts of ©, where E(0, 7) 
is continuous in 6. 


The above three forms of remainder require successively stronger assumptions 
about both the likelihood p(-|@) and the prior zr (0). Suitable sets of regularity 
conditions for the validity of (4.4) will be discussed in Section 5. In particular, 
x cC is a sufficient condition on the prior for the weakest form R1 of remainder. 
The form of D(@, x) for x € C will be derived in Section 5. 

Throughout the remainder of the paper we assume that x’ € C and define, for 
all x € C, 


(4.5) L(0,z) = D(0,z) — D(0,7?). 


We note that L(0, m) is well defined when xr is improper since the arbitrary nor- 
malizing constant in x does not appear in D(0, x). We will study the asymptotic 
behavior of the posterior predictive loss (2.5) as n — oo for an arbitrary num- 
ber mn > 1 of predictions Y;. Let c, = 2n(n + m,)/mg. The next theorem gives 
conditions under which 


(4.6) Cn Ly|x (0, 7) — L(0,71) 
uniformly in compacts of © under each of the forms R1—R3 of remainder. 


THEOREM 4.1. 


(a) Suppose that R1 holds. Then (4.6) holds whenever lim inf, ., o5 mj /n > 0. 
(b) Suppose that R2 holds. Then (4.6) holds whenever m, — oo. 


PREDICTIVE RELATIVE ENTROPY REGRET 451 


(c) Suppose that R3 holds. Then (4.6) holds for every sequence (mn) of positive 
integers. 


PROOF. First note that (3.2), (4.2), (4.4) and (4.5) give, on taking x7 (0) = 
li(8)]!/*, 








par E PES 
n (8) 2n 


where r,(0, x) —r4(0,z) — r4(0, x7). Also note that, since x € Moo, Lemma 3.1 
applies for all n. 


(a) From (3.4), (4.7) and RI, we have Ly|x (0,7) = c41L(0,z) + o(n™') 
and (4.6) follows since n^ lc, — 2(m; 1n + 1) and lim supp , 4m; !n < oo. 

(b) From (3.4), (4.7) and R2, we have Lyx (9,7) = c; 1 L(0,x) + O(n’) 
and (4.6) follows since n^ ^c, = 2(n-! - n^!) 0. 

(c) From (3.4), (4.7) and R3, we have Lyjix(0,*) = c,!(L(0,x) + 
d-!E(0,n)) + o(n ?), where d, = {2(2n + m,)) n(n + my) and 
E(0, 1) = E(0, 1) — E(0, x). (4.6) follows since d7! = O(n!) and n^?c, = 
2(nil--n^l)isbounded. O 


(4.7) Lx (9.7) log] | 5.0.0). 


Theorem 4.1 tells us that, although the predictive loss function (2.5) covers an 
infinite variety of possibilities for the amount of data to be observed and predic- 
tions to be made, it is approximately equivalent to the single loss function (4.5), 
provided that a sufficient amount of data X is to be observed. Although this is not 
surprising given the form of (4.7) and the relation (3.4), it considerably simplifies 
the task of assessing the predictive risk arising from using alternative priors. We 
will refer to L(0, x) as the (asymptotic) predictive loss function. A special case of 
interest arises when m, =n, which corresponds to prediction of a replicate data 
set of the same size as that to be observed. Note that in this case (4.6) holds under 
the weakest condition R1. More generally, Laud and Ibrahim [18] refer to the pos- 
terior predictive density of Y in this case as the "predictive density of a replicate 
experiment," which they study in relation to model choice. 

Now let Roo be the class of priors t € Q for which v € Qy for all n. Al- 
though the expected predictive loss Ly,xy(t, 7) is well defined (possibly --oo) 
when zt € [Ioco and t € $255, in general, the expected asymptotic predictive loss 
f L(0, x) dr (0) may not exist, and when it does, additional conditions will be 
needed for it to be the limit of the expected loss c, Ly|x (, x). In order to re- 
tain generality, we will extend the domain of definition of the asymptotic pre- 
dictive loss (4.5) so that it is defined for all x € II, and t € Qa. Thus, for 
x € IIoo, t € Rw and a given sequence (mn) of positive integers, we define the 
(asymptotic) predictive loss to be 


(4.8) L(t, x) = limsupc,Ly\x(t, 7), 
n— oo 


452 T. J SWEETING, G. S. DATTA ANDM GHOSH 


which always exists (possibly --oo). Thus, L(r, x) represents the asymptotically 
worst-case predictive loss when the prior x is used in relation to the alternative 
proper prior t. Since the degenerate prior v = (0) is in Qoo, (4.8) also provides 
a definition of L(0, x) for all x € I155,0 € ©, which agrees with (4.5) whenever 
x €C C IIo and one of the conditions R1-R3 holds. 

Now define the (asymptotic) predictive information contained in t € £255 MN Teg 
to be 


(4.9) ¢(t) = -L(s, t) = liminfestyqx (t) 


and let P C $255 MN Ig be the class of r for which ¢ (Tt) < oo. Finally, for x € Meo 
and c € ®, define 


(4.10) d(t,x) — L(t,z) - C(t), 


which is the asymptotic form of equation (2.8). The next lemma implies that the 
predictive loss function (4.8) is a -proper scoring rule and that d(r, 7) is the 
regret associated with L(t, 7). 


LEMMA 4.1. Forall t € ®, 
inf L(r,z)— L(t, v) — —£(t1). 
TET log 


PROOF. Let v € ®. By construction, d(r, x) = 0, so we only need to show 
that d(r,7) > 0 for all x € Ile. Since x € Il and tT € QQ N Ieo, we have 
x Elly y and t € Qg, y CI x,y for all n and, hence, the quantities Ly|x (7, 7) and 
Ly\x(t, t) are both well defined. But Ly|x(v, 1) € Lyjx(t, x) and multiplying 
both sides of this inequality by c, and taking the lim sup, , 4, on both sides of the 
resulting inequality gives L(t, 1) < L(t, 7). The result follows from the definition 
ofd(t,7). Ll 


When zt € C, L(@, 7) is independent of the sequence m, . In general, however, 
both L(t, x) and ¢ (t) may depend on the particular sequence (m, ), although we 
have suppressed this dependence in the notation. Nevertheless, the minimax results 
of Section 6 will be independent of (mj). 


5. Derivation of the asymptotic predictive loss function. In this section we 
obtain the form of the function D(@, 7) arising in the O(n—') term in the as- 
ymptotic expansion of the prior predictive regret dx (0, 7). This then leads to an 
expression for the asymptotic predictive loss function L(0, x) for all x € C via 
relation (4.5). The computations involved in the determination of D(@, 7), which 
are similar in nature to computations in [14], are technically quite demanding. Fi- 
nally, we deduce expressions for the asymptotic posterior predictive regret (4.10) 
and predictive information (4.9) under certain conditions. 


PREDICTIVE RELATIVE ENTROPY REGRET 453 


Theorem 5.1 below is the central result of this section. Write D; = 0/807, 
j — L...,p. Let p = p(0) = logzr(0) and write p, = D,p. We use the sum- 
mation convention throughout. 


THEOREM 5.1. Assume that one of the conditions R1-R3 holds. Then 


(5.1) D(0, x) — A(0,nx) + M(0), 
where 
(5.2) A(0, x) =i" pr ps t 2Ds(i^ pr) 


and M (0) is independent of x. 


We will prove Theorem 5.1 via four lemmas, each of which evaluates the lead- 
ing term in one of the terms on the right-hand side of equation (4.3). We discuss 
suitable sets of regularity conditions following the proof. 


; _ ð ð 9 E 
For 1 < jkr,.. <= p, define D ik; .. = 567 80F 9607 ^» ikre = 


{D jkr LO) }g 3; Cj; = —Ayr,C = (Cyr), po (6^ Pyk = Dyke, Pyk = 
pk .(8) and 
kkl- rst e5 kki rst: (0) = E* (D ju. log f (Xu 0) Drst .. log F(X; 0)). 
Also define 
ky — i^ (pr t P; Pr), k3 = 3k rui, P", 
kt = 3kijrpsi" i", ki = 15k jrskuvwi t" i “i?” 





and 
Q1 = Dysi™, Q5 =k, Q3 = 3D; (ki i i^), O4 =k: 


LEMMA 5.1. 


1 1 1 l 
E? (b > z(& — kt + ik" &). 
ne LAR) 25 ti 2 se 36 


PROOF. Comparing with the Bayesian Bartlett correction factor as given in 
equation (2.6) of [13], we obtain 


(5.3) T T (m + pint gibt ue Hs) +0(n-), 
where 

Hy =!" (pyr cB) H2 = Bayrsycl”™, 

H3 = 3ayrpsc%c", Ha = I5d;süuonc^ € 6c". 


Noting that E? (H3) = k? +0(1),a=1,...,4, the lemma follows from (5.3). O 


454 T. J. SWEETING, G S. DATTA AND M. GHOSH 


LEMMA 5.2. 
nbe(@) > z- (0: 0 503 + 3204) 
j m ee qe igo e n 


PROOF. Comparing with the frequentist Bartlett correction factor as given in 
equation (2.10) of [13], we obtain 


1 1 1 1 
br(0) = — E ies x -l 
F(@) Fon ( O1+ 5502 503 +3204) +o(n ), 
from which the result follows. O 


LEMMA 5.3. 


z (Ê) M m 
Perl] | tt gin) 


where b' = ij'iV ky, + Lil" if ky. 
PROOF. From [20], page 209, we see that 
(5.4) E? (0") 20" -n^!&' Hon’), 
(5.5) Cov? (0" , 65) = nli" + o(n^). 
By applying Bartlett's identity, 
kykt d kyke + Regt + ht gk +p kt 0 


(cf. equation (7.2) of [20]), it can be seen that our expression for b” agrees with 
that of McCullagh. From (5.4), (5.5) and the Taylor expansion of p(@) around 0, 
we obtain 


E° (p(0)) = p(@) - n^ ' pr + 5^ prsi” +00"), 
from which the lemma follows. D 


LEMMA 5.4. 


"Pres; 


1 
> —i^ (ss st P bi T jura) 


et 51^ P" (Ris —lj ily) + kjisi kivi + kot “Kja + kirkii A 


PREDICTIVE RELATIVE ENTROPY REGRET 455 


PROOF. By the Taylor expansion of a;, =}; , (8) around 0, we get 
(5.6) ajr = ky (0) +e yr +007’), 
where 
ejr = ljr — kyr  Kjes (0* — 0°) 
rs — kjY(0* — 05) + 1k, (0* — 05) (0! — 0^). 
From (5.6) and (5.7), we obtain 
C —i(0) — E, +o(n"}), 


where E, = (e;r). Noting that J = nC, I(0) = ni(0), i(8) positive definite and 
E, is a matrix with elements of order O (n^ '/^), from the above expression for C 
and standard results on the eigenvalues and determinant of a matrix, it follows by 
the Taylor expansion that 


(5.7) 


URN ETC NE T 
(5.8) les| 50 | = tríi  (O) Es} zut (O)E,i (0)E,)--o(n ^'^). 


Using an expansion for 05 — 07 as in [20], Chapter 7, we obtain 
(5.9) 08 —8* =i (1j E i" 1, (jg — kj) + ke i" luly} ton”). 
Substituting (5.9) into (5.7) and using (5.4) and (5.5), it follows that 
(5.10) E? (esr) =n} (kj P5 keys i? + Ak esi?!) + o(n 0) 
and 
uis E^ (e peu) =n (jr ku — i jriku) 
+ riu, w + Kkuwk jr, + Kjreikuw)i! "] + o(n7 1). 


While all four terms on the right-hand side of (5.7) are required in evaluat- 
ing (5.10), only the first two terms on the right-hand side of (5.7) are required in 
evaluating (5.11). The lemma follows on taking expectations on both sides of (5.8) 
and using (5.10) and (5.11) on the right-hand side. (1 


PROOF OF THEOREM 5.1. First, putting Lemmas 5.1 and 5.2 together gives 
np(E* (bg) + be(0)) > 5((Q1--kD — 3(Q3 — k3) + t(Q2 + 104)]. 
Along with Lemmas 5.3 and 5.4, this gives equation (5.1) with 
A(0, x) = i" (py ps + 2prs) + 2(kjku + kjk uiti?” pr. 
Now note that D,ixj = —D, E (gj) = —(Kkyr + kkj,r) so that 
A(O, 1) — i (pr ps + 2prs) — 2D, jg)i i?" pr. 


456 T. J. SWEETING, G. S. DATTA ANDM GHOSH 
Finally, D, (i,,)ii/" = — D, (i^i jki?" = — D, (i) and so 


A(0, x) =i" (py ps + 2Prs) 4 2D; (i^) p, = i op ps + 2D, (i^ pr), 
as required. Ll 


We briefly discuss suitable regularity conditions on the likelihood and prior 
for the validity of the three forms of remainder R1-R3, although we will not 
dwell on alternative sets of sufficient conditions in the present paper. There are 
broadly two sets of conditions required, those for the validity of the Laplace ap- 
proximation of p"(x) and those for the validity of the approximation of each 
of the terms in (4.3). Consider first the form of remainder R2, ignoring for the 
moment the uniformity requirement. À suitable set of conditions for this form 
of remainder is given in Section 3 of [15], which constitutes the definition of a 
"Laplace-regular" family. Broadly, one requires /(8) to be six-times continuously 
differentiable and 2(@) to be four-times continuously differentiable, plus addi- 
tional conditions controlling the error term and nonlocal behavior of the integrand. 
Since additionally we require uniformity in compact subsets of € in R2, we need 
to replace the neighborhood B,(@o) in these conditions by an arbitrary compact 
subset of ©. In addition to these conditions, for the approximation of the terms 
in (4.3) we require the expectations of the mixed fourth-order partial derivatives 
of log f (X; 0) to be continuous and also conditions guaranteeing the expansions 
for the expectation of Ó needed in the proofs of Lemmas 5.3 and 5.4, as given in 
[20], Chapter 7. From an examination of the relevant proofs, it 1s seen that a slight 
strengthening of the above conditions will be required for the stronger form R3 
of remainder. For example, /(0) and zr (0) seven-times and five-times continuously 
differentiable, respectively, will give rise to a higher-order version of Laplace- 
regularity. Finally, the weaker form of remainder R1 would apply when /(@) and 
x (8) are only four-times and twice continuously differentiable, respectively, again 
with additional regularity conditions controlling, for example, the nonlocal behav- 
ior of the integrand in the Laplace approximation and giving uniformity of all the 
o(n~') remainder terms. 

Returning to the predictive loss function, it follows from Theorem 5.1 that, for 
x € C, the asymptotic predictive loss function (4.5) is given by 


(5.12) L(0,z) = A(0,z) — A(0, x), 


where A(0, z^) = i? v.v, 2D; (i5 v.) and v «log; = 5 log |i]. It is interesting 
to note that (5.12) is of the same form as the right-hand side of the first expression 
in Theorem 4 of [14], which relates to the comparison of estimative predictive 
distributions based on Bayes estimators. In the case of a single prediction (m = 1), 
the connection can be understood from Theorem 7 of [14], which establishes that, 
to the asymptotic order considered here, the Kullback—Leibler difference between 
the posterior and the associated estimative predictive distributions is independent 


PREDICTIVE RELATIVE ENTROPY REGRET 451 


of the prior. The derivation of Theorem 5.1 given here is more direct, as it does not 
involve Bayes estimators. Moreover, our result applies for an arbitrary amount of 
prediction. 

Note that L(8,71) only depends on the sampling model through Fisher's in- 
formation. The quantity M (8), however, involves components of skewness and 
curvature of the model. We do not consider M (0) further in this paper, although 
its form, which may be deduced from the results of Lemmas 5.1—5.4, may be 
of independent interest. It may be verified directly that L(0,7r) is invariant un- 
der parameter transformation, as expected in view of (4.6) and the invariance of 
Lyx (8, x). Furthermore, since all the terms in (4.2) are invariant, it follows that 
M(0) = M(0) + A(0, 1^) must also be an invariant quantity. In the case p = 1, 
we obtain the relatively simple expression 


(5.13) M(0) = oti t iy^, 
where a 1| 1s the skewness and y? = 022 — a2, — | is Efron's curvature, with 
o jk... (8) = (i(9)) YT 7 E? t (0)I* (6) ...), 


where // is the jth derivative of l. 


EXAMPLE 5.1. Normal model with unknown mean. As a simple first exam- 
ple, suppose that X; ~ N(6, 1). Here i(@) = 1 and œ111 (0) = y?(0) = 0 so that 
L(0, x) = (p)? -- 2p" and M(0) = 0 from (5.13). By construction, L(0, x/) — 0, 
but note that the improper priors x^ e exp(c(0 — 69)), c € R, also deliver constant 
loss, with L(6, 2°) = c^ > 0. We will see in Section 6 that Jeffreys’ prior is mini- 
max in this example. Since here M (0) = 0 and zr (0) œ 1, this result also follows 
from the exact analysis of the criterion (2.1) in [19]. 


Now let €2 be the class of priors having compact support in © and let T = 
QN C. It follows from (4.6) that if w € C and t € Q, then L(z, 7) is equal 
to the expected predictive loss f L(@,2)t(6)d@. Since v € C, we also have 
tít) 2 — f L(0, v)v(0) dé, which is finite since L(6, t) is continuous and, hence, 
bounded on compact subsets of O. The next result gives expressions for the pre- 
dictive regret d (t, 7) and predictive information ¢(t) when x € C and t eT. The 
expression for ¢ (t) here is similar to that given in Theorem 5 of [14] for the Bayes 
risk of bias-adjusted estimators. 


LEMMA 5.5. Suppose x € C and x € T. Then 


(5.14) d(z, x) = J i" (py — br) (Ps — Ms) dê 
and 
(5.15) t(t) = J i (uy — vr) (jts — vy)t d9, 


where u = log t. 


458 T J. SWEETING, G. S DATTA AND M. GHOSH 
PROOF. From (5.2), integration by parts gives 
(5.16) [ 4€. zx )t(8)d0 = [i oor d0 — 2 | i^ pr ust dO + 2B(t, 72), 
where 
4 rs ase) rais) 
scm -Y [i orci G9} 40 
$— 
and 05(0(79) and 05(8 79) are the finite lower and upper limits of integration 


for 6° for fixed 079), the vector of components of 0 omitting 6°. But (t, 7) — 0, 
since both x and t are in C. Therefore, 


(5.17) J 49.20«64 = f i oro, — 2us)1 dO. 
Evaluating (5.17) at zt = x € C gives 
(5.18) J 460x046 =- fius: dd. 


It now follows from (5.17) and (5.18) that 


d(t,m)=L(t,7)—L(t,t)= [ue 7) — A(0, v))v(0) d0 


= | i (orto, — 2us) + Urbs}t dO, 


which gives (5.14). Since ((t) = d(t, m7), (5.15) follows on evaluating the above 
expression at 7 — 7". 


The expression (5.15) for the predictive information ¢(r) is seen to be invariant 
under reparameterization, as expected. It might appear at first sight that ¢ (t) will 
attain the value zero at t = zr , but this is not necessarily the case since x7 may 
be improper and there may be no sequence of priors in I converging to x” in the 
right way: see the next section. Finally, note that the form of d (t, 7) in Lemma 5.5 
implies that L(0,7:) is a V-strictly proper scoring rule since d(t, x) attains its 
minimum value of zero uniquely at zt — t Er. 


6. Impartial, minimax and maximin priors. As expected, for a given prior 
density x € Io, from (4.10) the posterior predictive regret will be large when the 
predictive information (4.9) in t is large. Therefore it is not possible to achieve 
constant regret over all possible t € ®, nor minimaxity since the regret is un- 
bounded. Instead, as discussed in Section 2, we consider the predictive regret as- 
sociated with using 2 compared to using Jeffreys’ prior and study the behavior of 
the predictive loss function 


(6.1) L(t, x) =d(t, x) — d(z, s), 


PREDICTIVE RELATIVE ENTROPY REGRET 459 


which is the asymptotic form of the normalized version of equation (2.4). 

Adopting standard game-theoretic terminology, the prior x € Moo is an equal- 
izer prior if the predictive loss L(0, 7) is constant over 0 € ©. This is equivalent 
to the predictive loss (6.1) being constant over all r € I’. We will therefore re- 
fer to an equalizer prior as an impartial prior. The prior 79 € Meo is minimax if 
sup, cm L(t, zt9) = W, where 


W = inf supL(t,z) 
TElloo rco 
is the upper value of the game. To obtain minimax solutions, we will adopt a stan- 
dard game theory technique of searching for equalizer rules and showing that they 
are "extended Bayes” rules; see, for example, Chapter 5 of [4]. This is also the 
strategy used by Liang and Barron [19] for deriving minimax priors under the 
predictive regret (2.2) for location and scale families. In the present context the 
relevant result is given as Theorem 6.1 below. 

Let ®t C Iæ be the class of priors z in Ilo; for which there exists a 
sequence (r4) of priors in ® satisfying (i) L(tk, z) = f L(0,x)dr,(0) and 
(it) d(t%, 7) — 0. Since L(t, x) is a proper scoring rule, each tg is a Bayes so- 
lution and, hence, ®* can be regarded as a class of extended Bayes solutions. 
If x € t is an equalizer prior, then we can unambiguously define its predictive 
information as 


Cor) = lim £u) 


for any sequence tg € ® satisfying (i) and (ii) above. This is true since L(@, n) — c, 
say, for all 0 € ©, and so for every such sequence we have L(tg, x) = c for all k 
from (1). Therefore, from (4.10), 


(6.2) C(t) — d(ty, 2) — c, 


which tends to —c as k — oo. 
Finally, we define the class U C Too of priors x for which 
(6.3) lim sup cn sup Ly|x (0, 7) < oo 
n> oo AE 


for every sequence (m,,). Clearly, priors in U* have poor finite sample predictive 
behavior relative to Jeffreys’ prior. 


LEMMA 6.1. Suppose that x € CNU, that R1, R2 or R3 holds and that (mp) 
is any sequence satisfying the conditions in Theorem 4.1(a), (b) or (c), respectively. 
Then 


sup L(t, x) < m L(0, x). 
€ 


TED 


460 T. J. SWEETING, G. S. DATTA AND M. GHOSH 


PROOF. Let r € P,e > 0 and choose a compact set K C © for which 
fyc dt (0) < £. Then 


Ly\x(t, x) < sup Lyjx(0, 7) +e sup Lyjx(0, 70) 
GEK O0cK^ 


so that 


L(t, x) =limsupc,Ly|x(t, x) < sup L(0,z) +ke 
n oo QEK 


from (4.6) since x € C, where k = lim SuPp-oo Cn supg Ly|x(@, 7) < oo since 
x € U. The result follows since ¢ was arbitrary. DO 


We now establish the following connection between equalizer and minimax pri- 
Ors. 


THEOREM 6.1. Suppose that mp € 6 * (1C 1 U is an equalizer prior, that 
RI, R2 or R3 holds with x = xo and that (mp) is any sequence satisfying the 
conditions in Theorem 4.1(a), (b) or (c) respectively. Then mo is minimax and 
C (190) = infreg £(1). 


PROOF. Define 
W = sup inf L(t, x) 


red Eilio 
to be the lower value of the game. Then W < W is a standard result from game the- 
ory. Next, since zt is an equalizer prior, we have L(@, xo) = c, say, for all 0 € ©. 
Therefore, W = inf;en, SUPrep L(r,z) < SUPrep L(t, 0) X supgeg LOO, 
7:9) = c from Lemma 6.1 since 79 € C N U. Therefore, W « c. 

Since from Lemma 4.1 L(t,7) is a -proper scoring rule, we have 
infre L(v,z) > L(r,v) = —£(v) for every t € d. Therefore, W > 
— infzeg f(r). Since zo € F, there exists a sequence (tz) in with d(tx, 
xo) — 0. Therefore, since £(14) > infrem C(1) > —W and, from (6.2), ¢ (zy) — —c 
as k — oo, we have c < W. These relations give W<c< W and it follows that 


W = c = W. The result now follows from the definitions of minimaxity and č (zp). 
E 


We see that, under the conditions of Theorem 6.1, the minimax prior 79 has 
a natural interpretation of containing minimum predictive information about Y, 
since the infimum of the predictive information (4.9) is attained at t = 79. Equiva- 
lently, xro is maximin since it maximizes the Bayes risk —¢ (t) of r € P under (4.8) 
and, hence, is a least favorable prior under predictive loss. Notice also that The- 
orem 6.1 implies that sup, ca L(t, mo) = c, regardless of the particular sequence 
(mj) used. 


PREDICTIVE RELATIVE ENTROPY REGRET 461 


We note that for the assertion of Theorem 6.1 to hold we require that 7 sat- 
isfies condition (6.3). There may exist a prior mı € U^ which appears to domi- 
nate the minimax prior 79 on the basis of the asymptotic predictive loss function 
L(0, x). However, this prior will possess poor penultimate asymptotic behavior 
since Ly|x (0,7) will be asymptotically unbounded. This will be reflected in the 
value of Sup ep L(t, 7), which will necessarily be greater than supgee L(0, 7). 
This phenomenon will be illustrated in Example 6.1. 


COROLLARY 6.1. Assume the conditions of Theorem 6.1 and additionally 
that mo is proper. Then if £ (x9) = —c, where c is the constant value of L(0, 7o), 
then xo is minimax and ¢ (n9) = infreg ¢ (T). 


PROOF. Since d(zo,70) = 0 and /L(0,z09)dmo(0) = c = —ġt (ro) = 
L(x, zo), it follows on taking tk = mo that xo € T. The result now follows 
from Theorem 6.1. |] 


Suppose that xo € C N U is an improper equalizer prior. One way to show that 
mo € OT is to construct a sequence (tg) of priors in I for which d(tx, sto) — 0, 
where d (t, 79) is given by formula (5.14). As noted just prior to Lemma 5.5, the 
condition L(t,, x9) = f L(0, mo) dv, (0) is automatically satisfied when tg € T. 

We consider first the case p — 1. In this case it turns out that Jeffreys' prior 
is a minimax solution, and, hence, the assertion at the end of Example 5.1. Let 
Jt be the class of probability density functions h on (—1, 1) possessing second- 
order continuous derivatives and that satisfy A(—1) = A'(—1) = A"(—1) = A(D) = 
h' (1) = h"(1) =0 and 


ji 
(6.4) I (e^o hao) du < oo, 


where g(u) = log h(u); that is, the Fisher information associated with A is finite. 
The class J£ is nonempty, since the density of the random variable U — 2V — 1, 
where V is any beta (a, b) density with a, b > 3, satisfies these conditions. 


COROLLARY 6.2. Suppose that p = 1. Then Jeffreys’ prior is minimax and 
g(r?) = infzeo (rt). 


PROOF. Since L(0, z^) — 0, Jeffreys’ prior is an equalizer prior. We therefore 
need to show that x’ € &* (1 CU. Recall that x” € C was an assumption made 
in Section 4. Also, since Ly;x (0, mz’) — 0 for all n from (2.5), x € U. 

If zx is proper, the result now follows immediately from Corollary 6.1 since 
Cy| x C77) — 0 for all n. Suppose then that z/ is improper. Without loss of gen- 
erality, we assume that i(0) = 1, so that Jeffreys’ prior is uniform. Since zx is 
improper, without loss of generality we take © to be either (—oo, oo) or (0, oo) 


462 T. J. SWEETING, G. S. DATTA AND M. GHOSH 


by a suitable linear transformation. Now let U be a random variable with density 
heH. 

Suppose first that © = (—oo, oo) and let tg be the density of 8 = kU. Clearly, 
Tk € I, zy has support [—k, k] and 14, (0) = g'(u)/&, where ug = log v and u = 
0/ k. Therefore, from (5.14), 


1 
d(t, =!) = zz EG WY —> 0 


as k — oo from (6.4) so that x” € ®t. The result now follows from Theorem 6.1. 

Next suppose that © = (0, oo) and let tg be the density of 0 = k(U 4- 1) +1. 
Then t € I, tg has support [1, 2k + 1] and uu, (0) = g'(u)/k, where u = (0 — 
1)/k — 1. Therefore, from (5.14), 


1 
d(t%, z^) = a El (Y —0 


as k — oo from (6.4), so that x € ®t and again the result follows from Theo- 
rem 6.1. t] 


EXAMPLE 6.1. Bernoulli model. Here Jeffreys' prior is the beta (1/2, 1/2) 
distribution, which is therefore minimax from Corollary 6.2. The underlying 
Bernoulli probability mass function is f (x|8) —0*(1—0)17*, x =0,1,0<@ « I. 
Let 2% be the density of the beta (a, a) distribution, where a > 0. It is straightfor- 
ward to check from (5.12) that 


aus eciam 
10.3 - (a7 5)|-4(47 5) * 53-55] 

from which we see that L(0, 71) = —4, where zt = x ?^, the beta (3, 3) distri- 
bution. Hence, the prior x} would appear to dominate Jeffreys’ prior. In view of 
Corollary 6.2, however, we conclude that condition (6.3) must break down for this 
prior. Indeed, it can be shown directly that cn Lyx (0, 71) is an increasing function 
of m for fixed n and that, when m = 1, we have cau Lyx (0, x1) =n + O(1). By 
the continuity of Lyjx (0, xı) in (0, 1), it follows that c, supg Lyix (0, 711) — oo as 
n — oo for every sequence (mp) and so sr, ¢ U. Therefore, zt; exhibits poor finite 
sample predictive behavior relative to Jeffreys’ prior for values of 0 close to 0 or 1. 

It is of some interest to compare this behavior with the asymptotic minimax 
analysis under the prior predictive regret (4.1). Under (4.1), Jeffreys’ prior is as- 
ymptotically maximin [8], but not minimax due to its poor boundary risk behavior. 
However, a sequence of priors converging to Jeffreys' prior can be constructed 
that is asymptotically minimax [26]. Under our posterior predictive regret crite- 
rion, Jeffreys' prior is both maximin and minimax. In particular, it follows that it 
is not possible to modify the beta ( 3, 3) distribution at the boundaries to make it 
asymptotically minimax. 


PREDICTIVE RELATIVE ENTROPY REGRET 463 


In the examples below our strategy for identifying a minimax prior will be 
to consider a suitable class of candidate priors in C, compute the predictive 
loss (5.12), identify the subclass of equalizer priors in U and choose the prior 7to 
in this subclass, assuming it is nonempty, with minimum constant loss. Clearly, 
xo will be minimax over this subclass of equalizer priors. If, in addition, it can be 
shown that 79 € ®t, then the conditions of Theorem 6.1 hold and zo is minimax 
over ®. In particular, we will see that in dimensions greater than one, although 
Jeffreys' prior is necessarily impartial, it may not be minimax. This is not surpris- 
ing, since we know that in the special case of transformation models the right Haar 
measure is the best invariant prior under posterior predictive loss (see Section 2). 
Exact minimax solutions for Examples 6.2 and 6.3 under the predictive regret (2.2) 
have recently been obtained by Liang and Barron [19]. Finally, all these examples 
are sufficiently regular for the strongest form R3 of remainder to hold for the pri- 
ors 7t that are obtained. Hence, from Theorem 4.1(c), all the results will apply for 
an arbitrary amount of prediction. 


EXAMPLE 6.2. Normal model with unknown mean and variance. Here X ~ 
N(B,o7) and 0 = (8,0). We will show that the prior mol) x c^! is minimax. 
This is Jeffreys’ independence prior, or the right Haar measure under the group of 
affine transformations of the data. 

Consider the class of improper priors zx^(0) x o^ ^ on ©, where a € R. Trans- 
forming to $ = (f, A), where A = logo, these priors become 7° (¢) x exp{—(a — 
1)A} in the $-parameterization. Here we find that i(@) = diag(e~**, 2). Since 
p* ($) = logz^($) = —(a — 1)A, it follows immediately from (5.2) that A(@, 
1^) = 5(a — 1)?. Furthermore, since |i(@)| = 2e7?^, we have r7 (6) x e^ = 
7? ($) so that A(p, x) = 1. It now follows from (5.12) that L($, x^) = 1((a — 
1)? — 1}. Therefore, all priors in this class are equalizer priors and L(, z^) at- 
tains its minimum value in this class when a = 1, which corresponds to nelo) « 1, 
or zo9(8) oco ^! in the §-parameterization. Note that the minimum value = <0, 
which is the loss under Jeffreys’ prior. 

We now show that zo € ®t N C NU. Clearly, z € C, while zo € U follows 
because Ly|x (8, 70) is constant for all n since zo is invariant under the transitive 
group of transformations of € induced by the group of affine transformations of 
the observations (see Section 2). It remains to show that z9 € ®t. Let U1, U> be 
independent random variables with common density h € H and let tg be the joint 
density of $ = (B, A), where B = k1U1, à = k;U; and ki, ke are functions of k 
to be determined. Let up = log ty. Then ug, = kzlg'(U,),r = 1,2, where g = 
logh. Write a = f!,(g'(u))?h(u) du < co since h € 3€. Since po($) = log rol) 
is constant, it follows from (5.14) that 


d (x, mo) = E[kj ^ e^ (g (U^ + Sky tg (U2)] < oki? e? + 157}, 


464 T. J. SWEETING, G. S. DATTA AND M. GHOSH 


since à < ko. Now take kı = ke*, ko =k. Then d(14,710) < 2 > 0as k — oo 


and, hence, zro € P+. It now follows from Theorem 6.1 that xro is minimax and 
that ¢ (7t9) = j. 


EXAMPLE 6.3. Normal linear regression. Here X, ~ N tal B, ay. 
i = 1,...,n, where Z, = (ong) is an n x g matrix of rank q > 1 and 
0 = (P,a). Using a similar argument to that in Example 6.2, we can show that 
again Jeffreys’ independence prior, or the right Haar measure, z9(0) x o^! is 

Since the variables are not identically distributed in this example, it is not 
covered by the asymptotic theory of Sections 4 and 5. However, under suitable 
stability assumptions on the sequence (z,) of regressor variables, at least that 
V, =n! Ze Zn is uniformly bounded away from zero and infinity, then a version 
of Theorem 5.1 will apply. 

Proceeding as in Example 6.2, we again consider the class of priors z^(0) « 
c ^7 on O, where a € R. Transforming to @ = (f, A), where A = logo, these pri- 
ors become x^ ($) œx exp{—(a — 1)A}. Here we find that i, (@) = diag(e ^^ Vn, 2) 
and, exactly as in Example 6.2, we obtain A($, x^) = ¿(a — 1). Here |i,(¢)| = 
2V,,je774* so x! ($) x e^ 4^ = x4*($) for all n, giving Al, 2”) = Th and, 
hence, L(g, x^) = 1 ((a — 1)? — q*}. Therefore, all priors in this class are equal- 
izer priors and L attains its minimum value in this class when a — 1, which cor- 
responds to rol) « 1, or zx9(0) xo^! in the 6-parameterization. Notice that the 
drop in predictive loss increases as the square of the number q of regressors in 
the model. Note also that the ratio |i,|~']i,41| is free from 6, so that a version of 
Theorem 4.1 will hold. 

Exactly as in Example 6.2, zo € CNU and it remains to show that zro € ®*. Let 
p=q+landU,j, j =1,..., p, be independent random variables with common 
density h € J£. With the same definitions as in Example 6.2, let B, = k,U;,r = 
1,...,4, 4 = k2Up, so that pgr = k; g (U), r = 1... q Up = k3 g (Up). 
Then it follows from (5.14) that, with the summations over r and s running from 
] to q, 


d(v, 70) = E [e^ VI" nactus + 3ui,] 
= E{k ^e? V5 g'(U,)g' (Us) + 35^  (Up^] 
< ok; ^e" trace(V, !) "a 5 b 
using i g'(u)h(u) du = 0. Now take kı = ke* ko = k. Then, as before, d(14, 
zo) > 0 as k — oo and, hence, zp € +. It follows from Theorem 6.1 that yro is 


minimax and ¢ (yro) = T 

Interestingly, we note that the priors zo identified in Examples 6.2 and 6.3 also 
give rise to minimum predictive coverage probability bias; see [12]. The next ex- 
ample is more challenging and illustrates the difficulties associated with finding 
minimax priors more generally. 


PREDICTIVE RELATIVE ENTROPY REGRET 465 


EXAMPLE 6.4. Multivariate normal. Here X ^ Ng (p, 2), with 0 comprising 
all elements of u and X. Write X^! = T'T, where T = (tij) is a lower trian- 
gular matrix satisfying tı > 0. Let p = (ui,.... Ug), Yi = ti, L <i <q, Y = 
(Wi, ett) Va), Bij = Lt I< j <i < q and po d (Bii, "ttg Bu, 2 €i €q. 
Then y = (V^, B O T BD, w'y is a one-to-one transformation of 0. The log- 
likelihood is 


2 
q q ] 
I(y) = » logy, -AEE e-n] | 
i=] ixl J=l 
writing &,, — 1,2 — eg: One then finds that the information matrix i(y) is 
block diagonal in V, ..., Yq, BO ,..., B®, u and is given by 


dia (2i, ^92 Eee WZ Eq-1q-1 3 


where E; is the submatrix of X corresponding to the first i components 
of X. Using the fact that |X; | = = ui = ],...,q, we obtain [i(y)| = 
SIL 4. 

Consider the class of priors x?^(0) « |X|- (4*7-9/? on ©, where a € R. In 
the y -parameterization, this class becomes z^(y) « Les V? "deas Noting that 
the case a — O is Jeffreys' prior, it is straightforward to show from (5.12) that 
L(y, x) = $((a — 1)* — 1}. Therefore, all priors in this class are equalizer priors 
and L attains its minimum value within this class when a — 1. From invariance 
considerations via affine transformations of X, it can be shown that these priors 
are also equalizer priors for finite n and, hence, are all in the class U. These results 
therefore suggest that the right Haar prior zt9(0) « |X|- 4*U/? arising from the 
affine group is minimax. However, in this example it does not appear to be possi- 
ble to approximate 7to by a sequence of compact priors, as was done in the previous 
examples. We conjecture, however, that zo can be approximated by a suitable se- 
quence of proper priors so that Theorem 6.1 will give the minimaxity of ro, but 
we have been unable to demonstrate this. This example does show, however, that 
Jeffreys’ prior is dominated by zo. 

Interestingly, further analysis reveals that the prior rı (y) « la yr! is also 
an equalizer prior and that it dominates zp. In the 0-parameterization this prior 
becomes z,(0) « EET 4 |Zju|) ^l. However, this prior is seen to be noninvari- 
ant under nonsingular transformation of X and, furthermore, does not satisfy the 
boundedness condition (6.3). 

In the case q = 2, in the parameterization $ = (p, 42, 01, 02, p), where o; is 
the standard deviation of X,,i = 1,2, and p = Corr( X1, X2), Jeffreys’ prior and 
7t) become, respectively, 


x’ (9) xo, “a, (1— p*)~*, 


molh) xo o5 (1 — o?) >”. 


466 T J. SWEETING, G S. DATTA AND M. GHOSH 


Therefore (see the paragraph below), xo is Jeffreys’ “two-step” prior. In the con- 
text of our predictive set-up, marginalization issues correspond to predicting only 
certain functions of the future data Y = (X4,1,..., Xn+m). In general, the associ- 
ated minimax predictive prior will differ from that for the problem of predicting the 
entire future data Y unless the selected statistics just form a sufficiency reduction 
of Y. Such questions will be explored in future work. Thus, if we were only inter- 
ested in predicting the correlation coefficient of a future set of bivariate data, then 
we might start with the observed correlation as the data X and use Jeffreys' prior 
in this single parameter case, which is 7x (p) e (1 — p*)~!. For further discussion 
and references on the choice of prior in this example, see [6], page 363. 

Finally, we note the corresponding result for general g in the case u known. 
Again, considering the class of priors zx ^(0) œ |X|-4*2?-9)/? on ©, we find that 
the optimal choice is a = 1, so 7 is as given above and in this case coincides with 
Jeffreys' prior. This was also shown to be a predictive probability matching prior 
in [12] in the case q = 2. 

Under the conditions of Theorem 6.1, it is possible to change the base measure 
from Jeffreys’ prior to zo, since zg is neutral with respect to x’ under L(0, 7). 
Denoting quantities with respect to the base measure yro with a zero subscript, 
since L(0, 19) =c < 0 and ¢ Oro) = —c, we have, for x € Too, 


Lol, x1) = L(0,x) — L(0,10) - L(0,) —c 
and for t € ®, 


Colt) =) +e. 


Therefore, with respect to the base measure yro, the predictive loss under xo be- 
comes Lo(0, x9) = 0 and the minimum predictive information, attained at m = x0, 
is zero. 


7. Discussion. In this paper we have obtained an asymptotic predictive loss 
function that reflects the finite sample size predictive behavior of alternative pri- 
ors when the sample size is large for arbitrary amounts of prediction. This loss 
function is related to that in [14] for the comparison of estimative predictive dis- 
tributions based on Bayes estimators. It can be used to derive nonsubjective priors 
that are impartial, minimax and maximin, which is equivalent here to minimizing 
a measure of the predictive information contained in a prior. In dimensions greater 
than one, unlike an analysis based on prior predictive regret, the maximin prior 
may not be Jeffreys’ prior. A number of examples have been given to illustrate 
these 1deas. 

As discussed in [23], as model complexity increases, it becomes more difficult 
to make sensible prior assignments, while at the same time the effect of the prior 
specification on the final inference of interest becomes more pronounced. It is 
therefore important to have sound methodology available for the construction and 


PREDICTIVE RELATIVE ENTROPY REGRET 467 


implementation of priors in the multiparameter case. We believe that our prelim- 
inary analysis of the posterior predictive regret (2.1) indicates that it should be a 
valuable tool for such an enterprise. More extensive analysis is now required, par- 
ticularly aimed at developing general methods of finding exact and approximate 
solutions for the practical implementation of this work and investigating connec- 
tions with predictive coverage probability bias. Local priors (see, e.g., [23, 24]) 
are expected to play a role. It would also be interesting to develop asymptotically 
impartial minimax posterior predictive loss priors for dependent observations and 
for various classes of nonregular problems. In particular, all the definitions in Sec- 
tion 2 for nonasymptotic settings will apply and could be used to explore predictive 
behavior numerically. 


Acknowledgments. We would like to thank two referees and an Associate 
Editor for their constructive comments and suggestions for improving the clarity 
of this paper. 


REFERENCES 


[1] AITCHISON, J. (1975) Goodness of prediction fit. Biometrika 62 547—554. MR0391353 
[2] AKAIKE, H. (1978) A new look at the Bayes procedure. Biometrika 65 53-59. MR0501450 
[3] BARRON, A R. (1999). Information-theoretic characterization of Bayes performance and 
the choice of priors in parametric and nonparametric problems. In Bayesian Statistics 6 
(J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 27-52. Oxford Univ. 
Press, New York. MR1723492 
BERGER, J O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, 
New York. MR0804611 
[5] BERNARDO, J. M. (1979) Reference posterior distributions for Bayesian inference (with dis- 
cussion). J. Roy. Statist. Soc. Ser B 41 113-147. MR0547240 
[6] BERNARDO, J. M. and SMITH, A F. M. (1994). Bayesian Theory. Wiley, Chichester. 
MR1274699 
[7] CLARKE, B. S. and BARRON, A. R. (1990). Information-theoretic asymptotics of Bayes meth- 
ods. IEEE Trans Inform. Theory 36 453-471. MR1053841 
[8] CLARKE, B. S. and BARRON, A. R. (1994) Jeffreys’ pnor is asymptotically least favorable 
under entropy risk. J. Statist. Plann. Inference 41 37-60. MR1292146 
[9] CLARKE, B. and YUAN, A. (2004). Partial information reference priors: Derivation and inter- 
pretations J Statist. Plann. Inference 123 313-345. MR2062985 
[10] Cover, T. M. and THOMAS, J. A. (1991). Elements of Information Theory. Wiley, New York. 
MR1 122806 
[11] DATTA, G. S. and MUKERJEE, R (2004) Probability Matching Priors: Higher Order Asymp- 
totics. Lecture Notes in Statist. 178. Springer, New York. MR2053794 
[12] DATTA, G. S., MUKERJEE, R., GHOSH, M. and SWEETING, T. J. (2000). Bayesian prediction 
with approximate frequentist validity. Ann. Stanst 28 1414—1426. MR1805790 
[13] GHOSH, J. K. and MUKERJEE, R. (1991) Characterization of priors under which Bayesian and 
frequentist Bartlett corrections are equivalent in the multiparameter case J. Multivariate 
Anal. 38 385-393 MR1131727 
[14] HARTIGAN, J. A. (1998) The maximum likelihood prior. Ann. Statist. 26 2083-2103. 
MR1700222 


[4 


ee i 


468 T. J. SWEETING, G. S. DATTA ANDM GHOSH 


[15] KASS, R. E., TIERNEY, L. and KADANE, J. (1990). The validity of posterior expansions based 
on Laplace's method. In Bayesian and Likelihood Methods in Statisncs and Economet- 
rics (S. Geisser, J. S. Hodges, S J. Press and A. Zellner, eds ) 473-488 North-Holland, 
Amsterdam. 

[16] KOMAKI, F. (1996). On asymptotic properties of predictive distributions. Biometrika 83 
299-313. MR1439785 

[17] KUBOKI, H. (1998). Reference priors for prediction. J Statist. Plann Inference 69 295-317. 
MR1631332 

[18] LAUD, P. W. and IBRAHIM, J G. (1995). Predictive model selection J Roy. Statist. Soc. Ser. B 
57 241—262. MR1325389 

[19] LIANG, F. and BARRON, A R (2004) Exact minimax strategies for predictive density estima- 
tion, data compression, and model selection. JEEE Trans. Inform Theory 50 2708-2726. 
MR2096988 

[20] MCCULLAGH, P. (1987). Tensor Methods in Statistics Chapman and Hall, London. 
MR0907286 

[21] SUN, D. and BERGER, J O (1998). Reference priors with partial information. Biometrika 85 
55-71. MR1627242 

[22] SwEETING, T J. (1996). Approximate Bayesian computation based on signed roots of log- 
density ratios (with discussion) In Bayesian Statistics 5 (J. M. Bernardo, J. O. Berger, 
A. P. Dawid and A. F, M. Smith, eds.) 427-444. Oxford Univ. Press, New York 
MR1425418 

[23] SWEETING, T J. (2001). Coverage probability bias, objective Bayes and the likelihood pnnci- 
ple Biometrika 88 657—675. MR1859400 

[24] SWEETING, T. J. (2005). On the implementation of local probability matching priors for inter- 
est parameters Biometrika 92 47-57. MR2158609 

[25] TIBSHIRANI, R. (1989). Noninformative pnors for one parameter of many. Biometrika 76 
604—608. MR 1040654 

[26] XIE, Q. and BARRON, A. R. (1997) Minimax redundancy for the class of memoryless sources 
IEEE Trans Inform. Theory 43 646-657. 


G S DATTA M GHOSH 

DEPARTMENT OF STATISTICS DEPARTMENT OF STATISTICS 
UNIVERSITY OF GEORGIA UNIVERSITY OF FLORIDA 

ATHENS, GEORGIA 30602-1952 GAINESVILLE, FLORIDA 32611-8545 
USA USA 

E-MAIL gauriGstat.uga.edu E-MAIL’ ghoshmGstat.ufl edu 


T J SWEETING 

DEPARTMENT OF STATISTICAL SCIENCE 
UNIVERSITY COLLEGE LONDON 
GOWER STREET 

LONDON, WCIE 6BT 

UNITED KINGDOM 

E-MAIL: trevor@stats ucl ac uk 


The Annals of Stanstics 

2006, Vol 34, No 1, 469-492 

DOI: 10 1214/00905360500000083 1 

© Institute of Mathematical Statistics, 2006 


ASYMPTOTIC NORMALITY OF EXTREME 
VALUE ESTIMATORS ON CT[0, 1] 


Bv JOHN H. J. EINMAHL AND TAO LIN 
Tilburg University and Xiamen University 


Consider n ia.d random elements on C[0, 1] We show that, under an 
appropriate strengthening of the domain of attraction condition, natural esti- 
mators of the extreme-value index, which is now a continuous function, and 
the normalizing functions have a Gaussian process as limiting distribution. 
A key tool is the weak convergence of a weighted tail empirical process, 
which makes ıt possible to obtain the results uniformly on [0, 1]. Detailed 
examples are also presented. 


1. Introduction. Recently considerable progress has been made in the inter- 
esting field of infinite-dimensional extreme value theory, where the data are (con- 
tinuous) functions. After the characterization of max-stable stochastic processes in 
C[O, 1] by Giné, Hahn and Vatan [11], de Haan and Lin [4, 5] investigated the do- 
main of attraction conditions and established weak consistency of estimators of the 
extreme value index, the centering and standardizing sequences, and the exponent 
measure. 

Statistics of infinite-dimensional extremes will find various applications, for ex- 
ample, to coast protection (flooding) and risk assessment in finance. For an appli- 
cation to coast protection, consider the northern part of The Netherlands, which 
lies for a substantial part below sea level. Since there is no natural coast defense 
there, the area is protected by a long dike against inundation. Flooding of the dike 
at any place could lead to flooding of the whole area, so the approach via function- 
valued data is the appropriate one here. In finance, the intra-day return of a stock 
is defined as the ratio of the price of a stock at a certain time £ during the day to 
the price at market opening. This process can be well described, when we mea- 
sure time in days, with a continuous function on [0, 1]. For various risk analysis 
problems (e.g., problems dealing with options), intra-day returns of the stock need 
to be taken into account, instead of just the daily returns (i.e., the function values 
at 1). Sampling on n days puts us in a position to apply statistics of extremes to 
these problems. 

Also, from a mathematical point of view, the research 1s challenging, because of 
the new features of C[0, 1]-valued random elements, when compared to random 


Received February 2002; revised January 2005 

AMS 2000 subject classifications. Pamary 62632, 62G30, 62G05; secondary 60G70, 60F17. 

Key words and phrases. Estimation, extreme value index, infinite-dimensional extremes, weak 
convergence on C[0, 1]. 


469 


470 J. H. J. EINMAHL AND T. LIN 


variables or vectors, in particular, the uniformity in f € [0, 1] of the results asks for 
novel approaches. 

It is the purpose of this paper to establish asymptotic normality of estimators 
of the extreme value index, which is now an element of C[O, 1], and of the nor- 
malizing sequences. In fact, we will show the asymptotic normality on C[O, 1] of 
the estimators under a suitable second-order condition and present all the limit- 
ing processes involved in terms of one underlying Wiener process, which means 
that we have the simultaneous weak convergence of all the estimators. The results 
are, on the one hand, interesting in themselves, because the extreme value index 
measures the tail heaviness of the distribution of the data, and, on the other hand, 
the results are a major step toward the estimation of probabilities of rare events in 
CIO, 1]; see [7] for a study of this problem in the finite-dimensional case. 

In order to be more explicit let us now specify the setup and introduce nota- 
tion. Let £1, &,... be 1.1.d. random elements on C[0, 1]. Define F; :R — [0, 1] by 
F; (x) = P(& (t) < x}. Throughout assume that 


(1) P| inf & (t) > o} -— 

and that 

(2) F, is a continuous and strictly increasing function on its support. 
Define 


U,(s) = F (1— 1/s), s>0,0<r<1. 


We assume that the domain of attraction condition holds, that is, 
(3) ( max & (t) — b, (m) / a, (n), t € [0, uj converges in distribution 
pesti 


on C[O, 1] to a stochastic process, n, say, with nondegenerate marginals, where 
a, (n) > 0 and b, (n) are continuous (in t) normalizing functions, chosen in such a 
way that, for each 1, 


P(n(t) < x) 2 exp(-(14- y (0x), 9y; 


see [4]. We can and will take b; = U,;. Then y :[0, 1] — R, the extreme value index 
(function), is continuous. Define 


1 
560) = IEE 


5(t) = (14- y (0n) *9, 


and w,(E) = sP{t, e sE}, with sE = (sh:h € E}. Clearly the 7, (t) are standard 
Pareto random variables, that is, P{¢,(t) < x) = 1 — 1/x, x > 1. It follows from 
Theorem 2.8 in [4] that there exists a measure v on C[0, 1] that is homogeneous 


STATISTICS OF EXTREME ON CI0, 1] 471 


[i.e., for a Borel set A and r > 0, v(rA) = Ly(A)], such that for any positive 
g € C[0, 1] and compact set K C [0, 1], 


P(n(t) < g(t), forall t e K} 
— exp(—v(Lf € C[0, 1], f(t) > g(t), for some t € K})) 
and 
(4) Vy > V as $ — OO, 
weakly on Se :— (f € C[O, 1]: supyero, 1] f (t) > c) for any c > 0, and 
U,nx)-U,(m) | xYO —1 


as n — oo, 
a, (n) y (t) 
uniformly in t € [0, 1] and locally uniformly in x € (0, oo). 
Throughout we assume that k — k(n) € (1,...,n — 1] is a sequence of positive 
integers satisfying 
k 
(5) k— oo and ——0 as n — oo. 
n 


Fix t € [0,1]. Let £ia4(t) < E2,n(t) € +--+ < En n(t) be the order statistics of 
£i (t), i — 1,2,..., n. We define the following statistical functions: 
| ke! 
(6) Mi (t) = c2 (log&n-is() -loge-ka (0), r=1,2 
1=0 
Set y ^ (£) = y(t) V0, y (t) = y(t) AO and observe that y(t) = y * (t) + y^ (t). 
Now we define estimators for y * (t), y ^ (t), y (t), a:(Z) and b, (7) as in [9]: 


(7) Prt) = MY) (Hill estimator); 
" bro QM D 
9 — &0o-1-1(i- MEO)". 
(9) yn (t) = Y. (t) +P, (t) (moment estimator); 
(10) Ü, (=) = En kat) (location estimator); 


(11) Ĝi (=) = En—k on OJA (O(1— Y, (0)) (scale estimator). 


For fixed t these are well-known one-dimensional estimators. Observe that y (1) 
and y. (t) are not equal to (y4) (t) and (Pn) (t), respectively. 
The following weak consistency results have been shown in [5]. 


472 J. H. J. EINMAHL AND T. LIN 


THEOREM 1.1. Jf (1), (2), (3) and (5) hold, then 





(12) p TO — 770) 5 0, 

(13) Sup, AO- €) ^ 0, 

(14) Ue(n/k) - Ui (n/K) 20. 
0<t<1 a;(n/k) 

(15) p. a EES 








The main results of the paper and examples are presented in Section 2; the 
proofs are deferred to Section 3. 


2. Main results. In this section we present our main result, dealing with the 
asymptotic normality of the estimators of which the weak consistency is shown 
in Theorem 1.1. In order to establish our main result, we first present a result 
that is a key tool for its proof. This result deals with the weak convergence of 
a tail empirical process based on the ¢;,i = 1,...,n. Write Cr x = {h € C[0, 1]: 
h(t) > x) and define 


] x Da 
Snt (2) =- $ Men 7 2 Mewes} 
n sl ^ iz] 
Denote, with & as in (5), the corresponding tail empirical process with 
1 
wilt; x)= Jis, (x2) — =). 
k '\kK x 


Let c > 0 and define C = {C; ,:0<t < 1,x > c). Let W be a zero-mean Gaussian 
process defined on © with EW (C; x) W (C5, y) = v(Cr,x N C, y). Clearly, for fixed 
t € [0,1], (W(Cr1/y), Y < 1) is a standard Wiener process, since v(C; 1, N 
Ct,1/y,) = V(Cri/(i Ay) = Y1 A y2. For B > 0, set, for any (t, x), (s, y) € [0, 1] x 
[c, 00), 


d((t, x), (s, y)) 


sre E (xP W (C, x) — yP W(C; y) 


(x8 = yP) v (Crx N Cs,y) + x?PBv(C; x \ Cs y) 2 y*Pv(Css b Cr, x). 





Observe that (4) implies that 7 P {% (f) = zx, &i (s) = EY} = vayk(Cr N Cs, y) > 
v(Cix N C; y). 


STATISTICS OF EXTREME ON C(O, 1] 473 


Define, for K > 0, 


KOROR 
h(s) i 


and assume that, for O < B < j and all ci > 0, for some K > 0 and for large 
enough v there exists a dg > 0 such that for all à € (0, do], 


1 \ 7QO*28)/0—25) 
sup £&(f)- v} <c] (10g 7 : 
re[s,s+5] ô 


1 -—3 
Ess = [he C10, 1}:h 0, K (loz 5) forall e [s,s -9]] 





(16) | sup Plg É E; 
$€[0,1] 

For convenient presentation and convenient application in the proofs of the 
main result, this result is presented in an approximation setting (with the random 
elements involved defined on one probability space), via the Skorohod-Dudley— 
Wichura construction. So the random elements w, and W in this theorem are only 
equal in distribution to the original ones, but we do not add the usual tildes to the 
notation. 


THEOREM 2.1. Let 0 x f < 4. Suppose the conditions (2), (3), (5) and (16) 
hold. Then for the new wp and W mentioned above, we have, for any c > 0, 


(17) sup x/|w,(t.x) -W(C;)] 0 — asn- oo. 

O0<t<Il,x>c 
Define Z(t,x) = xPW(C; x). Then the process Z is bounded and uniformly 
d-continuous on [0, 1] x [c, co). 


Note that it is well known that, for one fixed t, the restriction B « 5 is also 
necessary for weak convergence of the (one-dimensional) tail empirical process. 
So our condition on £ in the present infinite-dimensional setting is the same as in 
dimension one. 

Condition (16) is needed to prove tightness. It prevents the continuous random 
function from having extremely large oscillations. From the examples below we 
see that it is a rather weak condition, since they amply satisfy this condition. 

It is important to transform the £, to processes with standard marginals, as we 
did by transforming to the ¢;. Although the choice of standard Pareto marginals is 
convenient, it is also reasonable to transform to other marginal distributions, such 
as the uniform-(0, 1) distribution. Clearly, uniform-(0, 1) marginals are obtained 
by taking 1/7,. It is interesting to note and readily checked that the set E; 5 used 
in condition (16) is invariant under this transformation. 

It can be useful to replace |h(t) — h(s)|/h(s) by |logA(t) — logA(s)| in the 
definition of E, 5 used in condition (16). The thus obtained version of condition 
(16) is equivalent to the stated one (K can be different), but it might be easier to 
check for certain processes. 

We also need the following corollary, which deals with certain quantiles and can 
be obtained by the usual "inversion" from the tail empirical process theorem. 


474 J. H. J. EINMAHL AND T. LIN 


COROLLARY 2.2. Let « € C[0, 1]. We have, under the conditions of Theo- 
rem 2.1, 


(18) sup 5o as n — oo. 


Oxt xl 


Viento) m" 1) — a(t) W (Cr,1) 








Finally, we present the main result, which gives the asymptotic distributions of 
the estimators of y ^, y, a and b in terms of the process W, figuring in Theo- 
rem 2.1. 


THEOREM 2.3. Suppose the conditions of Theorem 2.1 and (1) are satisfied 
and the following second-order condition holds: for every t € [O, 1], there exists 
a function A; not changing sign near infinity with liMmy—oo SUPg<;<; | Ar(v)] = 9, 
such that, as v — oo, 


logU;,(vx) -logU;(v) xY 0—1 
a (Eroma GMO Home 


uniformly in t € [O, 1] and locally uniformly in x > 0, with 
X 2 y T 
Ay -(t), p(t) 0) = | yo | uP! du dy, 


with p(t) € [—oo, 0] for all t € [0, 1]. 
If, as n — oo, 




















ii 
: Am c) 
and 
ay (n/k) + | 
k — 0, 
ve vi sip U,(n/ Kk) ED 
then we have 
(22) up IWK(£ (t) — y+) — y* X)? 0, 
<t< 
(23) ‘sup Vk (nt) — y (D) — T (0| > 0, 
(24) sup JU SEGUE) CLR). — U(t) Rs 0, 
0xt«1 a, (n/ k) 
à; (n/ k) P 
nF om 0, 
g en P (n/h) I)-4 e) = 





STATISTICS OF EXTREME ON C(O, 1] 475 
where P,T, U and A are defined in terms of the process W as follows: 


2o - [^ wo = 
Sy OX l-77 1—y-(t) 
xY 0—1 dx 

y (t xir W 


W (C,.1), 


Qa)-2 | W (Cix) 


-2((1—v-())(1—-2y-()) WC), 

rO ={yt@) -20-v-0)0-2y OPO 
+41- rT OY- 277A aA), 

U(t) = W(C:,1), 

AQ) =V AWC) + 8 -4vy @)(1-y O)PO 
-41-7701 -277 H) QE), t € [0, 1]. 


Condition (19) is a uniform version of one of the natural, well-studied 
second-order conditions of univariate extreme value theory; see [3] and [8]. For 
p (t) > —oo, the absolute value of the function A; is regularly varying of order 
p(t) and specifies the rate of convergence in (3). For p(t) = —oo, we can choose 
Ay; such that its absolute value tends to O faster than a given power function. Large 
values of |o(t)| yield fast convergence, whereas small values and, in particular, the 
case p(t) = 0, correspond to (very) slow convergence. 

Note that for the case infy¢jo,1] y (t) > 0 and sup, rp 1) 2 (£) < 0, it follows from 
the second-order condition (19) that 


M" 
(26) p pt) (arv)/Ui(v)) —y" Q0) — 
te[0, 1] A;(v) 


So in this case (21) 1s superfluous, since it follows from (26) and (20). Also, note 
that condition (20) can be replaced by the stronger, but easier-to-check condition: 
for some € > Q, 





1) +0 as v — oo. 


Jk (z 

k 

For the case sup, epo ji; Y (t) < 0 and sup,eo 1) P(t) < 0, it follows from the second- 

order condition (19) that conditions (20) and (21) can be replaced by the stronger 
condition: for some € > 0, 

n 

xl — 

Vk ( : 

When sup,eo 1j O(t) = 0 or y (t1) = 0 for some t; € [0, 1] [this also implies 

p (t1) = 0], we do not have a simple sufficient condition on the growth of k, but it 

is necessary that k grow more slowly than any power of n. 


yon p(t) 


ETSUP; e(0,1) P(t) VSUP; 70, 1] y(t) 
) — 0 


476 J H. J. EINMAHL AND T. LIN 


EXAMPLES. In order to illustrate the theory and for a better understanding of 
the conditions in the theorem, we consider two classes of examples. 

Let f be a unimodal, continuous probability density function on the real line 
satisfying, for some K, ôo > Q and for all ô < dp, 


f(t) — fi 1\73 
————————— < K | Joo — f 
a 497 (ioe 5) 


This condition is satisfied by, for example, the double exponential density 
((A/2) exp(—A|x|)) or the t-distribution for any number of degrees of freedom. 


Let (X,, Y;), j =1,2,..., be an enumeration of the points of a homogeneous, 
rate 1, Poisson process on IR x R+. Now define 
t4- X 
£1 (t) = sup JU re [0, 1]. 
j Yj 


This process is studied in detail in [1]; see also [6]. In particular, 51 is a continuous, 
stationary, max-stable (i.e., limiting) process with marginals F;(x) = P{&(t) < 
x) — exp(—1/x), x > 0 (for all t and i). Observe that y = 1 here. 

For this process we will only check condition (16) in detail. The other conditions 
are easily seen to hold. In particular, (3) holds since the process 1s max-stable and 
(19) is well known to hold for the distribution function F; in the univariate case, 
that is, for fixed t. Since F; does not depend on f, it therefore holds uniformly in t 
and, hence, (19) holds. Note that o = —1. We now check (16). We have that 


1 i 
1— F(x) l-exp(-l/x) 
Hence, for large values of x the transformation g is close to the identity. Note that 
g(x) > x and g'(x) < 1 for x > 0. Hence, for s, t and ô as above, 


isi) — gEiG)) _ Ig(sup; FE + X5)/Y5) — glup; FS + X))/Yj)l 


=; g(x). 


g(&1(s)) Hu g(sup, f (s + X,)/Y;) 
_ Isup, FE + X,)/Y, — sup, f(s  X)/Y;| 
B sup, f(s + X )/Yj 
, UP VF ADM, — IEF AD 
B sup; f (s + X,)/Y, 
ou \f(¢+X,)/Y, — f(s +X,)/Y,| 
|J f X)/Yj 
If@+X,)-fe+X,)| 
<8uP TX) TX) : 


1 —3 
< K (log 5) ; 


STATISTICS OF EXTREME ON C[0, 1] 4T] 


So condition (16) is satisfied since the probability involved in the condition is equal 
to 0 for all s € [0, 1]. 

Let Y be a standard Pareto random variable, that is, P{Y <x} — 1 — 1/x for 
x > 1, and let B be a random element of C[0, 1] such that B(t) > 0, EB(t) = 1 
for all t € (0, 1], and E sup, B(t) < co. Assume Y and B are independent. Define 


Ei(t) — Y B(t) for t € [0, 1]; 


see also [12]. We first show that £; satisfies the domain of attraction condition (3); 
more precisely, we show that 1 max;-1,....&, converges in distribution to 7, where 
£1...., £y are i.l.d. We need to show the convergence of the finite-dimensional 
distributions and tightness. For the convergence of the finite-dimensional distribu- 
tions, let 1j, ..., fg € [O, 1], x1, ..., xy Z O and maxj1,..4%, > 0. Now we have 


greats 


1 1 
log P|- max Y,B,(fj) <x1,...,— max Y.B, (a) xi] 
n i-l,...,n ni, .,n 
1 1 
-nlgP|-YB() 5i... -YBH <x 
n n 
l 1 
= nog} — P| -YB > x, or... or —Y B(ty) > ef 
n 


1 ] 
^ —nPi-YB(n)»xjor ...or -Y B(ty) > «| 
n n 





| 
--nPl|Y min d | 
e| 22) 


=à me n 
JH lau k xX; 


sa max EM 


| "T 
3&8]. Sk Xj 


This settles the convergence of the finite-dimensional distributions. Note that 
for k = 1 the last expression is simply —1/x1, which means that again y = 1. 
Next we consider the tightness. From the derivation above, it follows that 
P{+ max, --1,.,... Y,;B,(Q) > M] can be made arbitrarily small for M and n large 
enough. So it remains to show that for € > O there exists a 6 > 0 such that, for 
large n large enough, 

> e} < E. 


Í 1 
— max Ene. max Y,B,(s) 
Hil, Hisl, jm 





P| sup 


[ts [<8 





478 J.H J EINMAHL AND T. LIN 


We have 
1 1 l 
—Pi sup |— max Y,B,(t) —— max Y,B;(s)|> e] 
E It—s|<5 | =i n i—l,. n 








I 
-Yi B, () — —Yi B; (s) 
n 





1 
«| max sup 
€ USL,- Ajt—s|<ð 





>e] 








1 i 
<"P| sup |—-Y B(t) — —Y B(s) > ef 
EU r—s|<6 170 n 
= "EPpy > Les] se 180 - 561] 
€ Supy..1«5 BO) — B(s)|  j—siss 


- -el SUP s< (B) — B(s)| | 
E ne 


= zE] sup |B(t) — BG). 
€ It—5]«ó 

Since B € C[0, 1], we have sup),_,).3 BE) — B(s)| — 0. But sup, ,,.5 | B(t) — 

B(s)| < 2sup, B(t) and, by assumption, E sup, B(t) « oo, so by Lebesgue's dom- 

inated convergence theorem Elsup;,. ,| <3 | B(t) — B(s)|} — O as ê || 0. This com- 

pletes the proof of the tightness. 

In the sequel we will make the specific choice B(t) — exp(W(t) — 5), t € [0, 1], 
with W a standard Wiener process. B is a geometric Brownian motion. Note that 
this process satisfies the conditions on B specified in the beginning of this example. 
In particular, E sup, B(t) < co follows from simply bounding W(t) — 5 by W(t) 
and the fact that the distribution function of sup, W (t) is well known to be 24» — 1, 
where 4 is the standard normal distribution function. The corresponding process 
£ = Y B has been introduced in [12]. It remains to consider (19) and (16) for 
this process. It follows from a straightforward calculation that, for every M > 1, 
uniformly in f € [0, 1], 
>a P{Yex (we) - 5) >ul 2>- 
u^ p 2 ~u uM" 


for u large enough. Hence 





u 


21 E a a 
Ten NUES 1— F(u) 1—u-ut2 


and for large v, 
v > U,(v) Spay". 


Now we consider (19) and note that E should be read as usual as log x. We see 


that, with a; / U; = 1, 


log U, (vx) — log U; (v) —logx 


< log vx — logv — log(1 — pon) —logx < Du MFN 


STATISTICS OF EXTREME ON C[0, 1] 479 


and similarly, 
— log U; (vx) + log U;(v) + logx < 2(ux) M TD 


This implies (19) with A; (v) = v^ and p = —oo. 

Finally, we have to show (16), which has to be proved for the transformed 
process £1. As we see from (27), for large values of v this transformation is very 
close to the identity function. So the transformed and untransformed processes 
are very close for high values. Nevertheless, the proof of (16) for the transformed 
process is more cumbersome than that for the untransformed process. We there- 
fore confine ourselves to proving (16) for the untransformed process, since this 
proof contains the main ideas. Also, we will use the modified version of Es, as 
described below Theorem 2.1, but we will keep the same notation. We have, for 
large enough v, 





Pla Es sup noz] 


te[s,s--ó] 


sPln£E sp s(t) > vl / PEE) 20} 


tée[s,st+éd] 


< 2vP Íg é Es, sup ¢\(t)> v} 
te[s,s--6] 


< 2uP{ 21 g E, a tiS) > = +2vP] sup &(0—ü(s)z ;| 
2 re[s,s+é] 2 


=: Dı + Do. 


Consider D, and use the independent increments property of a Wiener process: 


—3 
D, = wp] sup [log(Ye" 7*/^) — 1ogg(yeW()—5/2)| > K (log >) ; 
te[s,s +d] ô 


Ye" 5/2 — id ! 
2 


= 
= 2p] sup |W(t) — W(s) — (t/2—5/2)| > K (tog ;) ! 
te[s,s-+é8] ô 


x peer > ;| 
E^ 


K 1X7? 1 
< 4P| sup |W(t)— W(s) > 5 (10 z) «ex (-—), 
m | 2 : 6 JSS 


where, for the last inequality, one of the well-known bounds for the oscillations of 
the Wiener process is used. For D2 we obtain again, by the independent increments 


480 J H. J. EINMAHL AND T. LIN 


property, 
D, = ly sup (e 71/2 u eWis)-s/2) > ;| 
te[s,s4-0] 2 
< wP | Yew? sup (e Ore Orem _ 1) 2 =| 
te[s,s--ó] 72 
< yp yes sup (e —1)> 2) " 
ye[0, 1] 2 


where V is a standard Wiener process independent of W and Y. So the three terms 
in the latter probability are independent. Recall that Ee" 6)—5/7 = 1. Now 


OO 
E sup (e$ 0) = 1) = E (e 99e. YVO) 1) al e 26 (x) dx — 1, 
ye[0, 1] 0 
with @ the standard normal density. A straightforward calculation shows that the 
latter quantity is equal to 2e*/2(1 — @(—/8)) — 1 < V8. Hence 
De asEP [Y — — — ___ 
2e WG)5/2 sup, ero, p (vo) eun 


eV)-$/2 sup (e VO)... y 
«i0, 1] 


2 
«2v  EeW 75/2 g sup (eV 9VO) — 1) «4.1. à. 
H ye(0, 1] 


So Dı + D2 x 54/8. This is much smaller than the bound required in (16). Hence, 
we have proved that condition. 


It should be observed that, for both examples, condition (16) trivially remains 
satisfied if we transform the process 51 by transformations of the marginals by in- 
creasing, continuous functions. So as long as these transformations yield a process 
that satisfies the other conditions (including that the transformed process is an el- 
ement of C[0, 1]), we have a new process for which Theorem 2.3 is valid. In this 
way we can obtain processes with many different, and nonconstant, extreme value 
index functions. 


3. Proofs. 


PROOF OF THEOREM 2.1. We only give a proof for the case c = 1; for general 
c > 0 the proof is similar. For any £ € [0, 5) define 


ftx EE 1c, .x^, 
F ={ftx:0<t<1,x> I}. 


STATISTICS OF EXTREME ON CT, 1] 481 


Also, define the random measures 
1 


Ln i = —= Sek ns 
if Jr 


Zn, is a random function on F with 


1 
Zn ftx) = pile Gema: 


Then 


n 
xP wn (t, x) = Y (Zna (fex) — EZn, (fe x)). 
i=1 
First we are going to prove the tightness of (}77_.)(Zni(f) — EZnu(f)), f € F}. 
We need the following version of Theorem 2.11.9 in [13] (note that, indeed, the 
middle condition there is not needed here). 


DEFINITION 3.1. For any e > Q, the bracketing number Ni (e, F, L3) is the 
minimal number of sets Ns in a partition F = gu Fa; of the index set into sets 
Fz; independent of n such that, for every partitioning set Fej, 


n 
(28) »,E* sup [Zni(f) Zn (g)^ Se’. 
1—1 f,8¢F ey 


THEOREM 3.2. For each n, let Zi, 1, Z2. ..., Zn,n be independent stochas- 
tic processes with finite second moments indexed by a totally bounded semimetric 
space (F , d). Suppose 


n 
(29) Y E'IZasdeiümne-3)— 0 —— forevery A 0, 


ii 


where |Zni lf = SUP fes |n, Cf )] and 


bn 
(30) I Vlog Nile, F, L5) de > 0 for every à, | 0. 


Then the sequence $5 A (Zn, — E Zn) is asymptotically tight in 2° (37) and con- 
verges weakly, provided the finite-dimensional distributions converge weakly. 


We can define d on F by d( fix, fs,y) — d((t, x), (s, y)); see the first paragraph 
of Section 2. We first show briefly that our class of functions ¥ is totally bounded 
under the metric d. We consider w.1.o.g. only the case x < y. Since v is a finite, and 


482 J. H. J. EINMAHL AND T. LIN 


hence, tight measure on (A € C[0, 1]: SUp;e(o, 1] A(£) = 1}, we can, for any 5; > 0, 
find a 62 > 0 such that if |t — s| < 65, then 


V(Cs, y \ Crx) < v(Cs, \ Cs, 15,2) + (Cs, y 45/2 \ Crx) 
1 1 
< + ~8) < 
y ytdy/2 2 


and (hence), if + — ; < ó,, then 





1 1 
v(C;x \ Cs, y) = v(Cy x \ Cty) a v(Ci, y \ Cs y) = es ag y +ô; < 20. 
Now we have, for |t — s| < 62 and l =i <ô], 


y 
dada 
= (y? — xPY*v(C, x N Cs,y) 4 x"Pv(Ct \ Coy) - y Pv(Cs y \ Crx) 


1 1 
< (yf — xb)? v (Cs y) + x2 (- ^ 251) yy% E ^ à) 
2 
i 1 
s(m(z-2)) ze (nam) ey" (A) 
x y y X y 


28/1 1V 2/1 2p ( | 
<x eee sx e A29] ep y e ND 
x y X Y 


« 81728. 251728 q 8128. 481-28. 


(31) 


So, since 1 — 28 > 0, we see that for e > 0 we can find a 6; > 0 such that 
d(l fex fs,y) < €, for 1— ; < 81 and |t — s| < à. Since obviously F is totally 
bounded under the metric do(fi,x, fs,y) = |$ — 51 + |t — sl, the total boundedness 
under d follows. 

To prove (29), observe 


1 kN 
fedem sup (207) | 


Vk 0<t< 
So 
n 
> E\Znille Lz, te>a} 
i-i 
n k\8 
Er ( sup Br) T asup a & OG CUP] 
n CO 
32 = — | x? dF, (x 
( ) JE (JEP n( ) 


STATISTICS OF EXTREME ON C[0, 1] 483 


ENERO oo 


n 


OO 
i: (/kX)UP 


where 1 — F,(x) = P(supo-;« & (t)# > x}. Note that P{supg.,<)0(1) 2 x] = 
x^ (Rh € CI[0, 1]: SUPp<;<; A(t) = 1)). Hence it follows from (4) that the func- 
tion x > P{supg<;<; & (t) = x) is regularly varying at infinity with exponent — 1, 
SO 


x^ 1(1— F.(x))dx, 


P [supo 1 5) = ux) = 1 s0 
490 P{supg<;<j & (f) = u} x 


Let 0 < t < 1. Now it immediately follows from Potter's inequality (see, e.g., [2]) 
that, for large n and x > 1, 


(n/ k) P(supg-, «| & (t) = (n/&)x) c1 
(n/ k) P [supo «1 $i (1) 2 (n/k)} 7 


Also, we have as n — oo, 


p| sup TOES If wa ( [recto n: sup fe) 21]) 


k 0<t<1 0xt«l 


> v({ f ecto, 1]: sup ‘f@z1})= ra 


0<1<1 


for some positive, finite C. So for large n and x > 1, 
k t—1 
(33) 1 — Fi(x) < C-x 
n 


Hence, the right-hand side of (32) is bounded from above by 


CkQ8**-D/08), 8-1-0/8 4. BC Jk Ef pbtt2 a, 
(Sex) VB 
cel ZE i0400/8,08-:-0/0D _, ¢ 
1—8-—r , 


for t small enough, since B < 5 L That is (29). 
Next we will prove (30). For any (small) e > 0, let a = eYQ8-D. $ = 
exp(—e-!] and 0 = 1/(1 — K &?). Define 


F(a) = {ftx EF, x >a}, 
Fl, j) = {frx EF, l8 <t < (1+ 1)8,907 xx <07+!). 


484 J. H. J. EINMAHL AND T. LIN 


Then we have the "partition" F = F (a) U EE. d b log6] gv (I, j). First we 
check (28) for F (a): 


YE eo na — Zn il? 


i] fgeF 


-—nE sup (Zn, (f) i Zna (8) 
F 8EF (a) 


<4nE sup Z2,(f) 
fef(a) ` 


4n 2p 
s a sup 5(0)— z) 1 supo... GG (Dk/nza) 


Q< <| 


us T x^? d Fa (x) 
k Ja 


]—rc 
ed eer 2p--r—1 
End NUT ER 


—-4c—l pbt- 
i—2p—t 


where the last inequality follows from integration by parts and (33). Clearly, the 
latter expression is bounded from above by £? for x (and £) small enough. 
Now we consider (28) for the F (l, j). First note that 


l 4 
(j+1 8 
joa Pan iS Jk lisupis ss qa 5 (t)(k/n)=62}9 d 
] 
a G+) 
i JE Hope tens t (D (k/n)- 6! a EEs) 
1 


+1) 
+ i Sissors b (t) (k/n)z8/ ,C, £5.10 2 


Suppose 2; € Es, and supys<;<(j41)8 5i (t)* > 07. Then for small enough ô, 


sip (t) — ¢, (18) < Ket, (l8), 
là xt «(1--1)8 


and, hence, £, (18)* > 6J-! So 


1 
sup Zni(f) S lt 05/n)201-1, g E 40 ^ 


fEF (hy) k 


l +1)B 
is i Lisupues stes 6 ((k/n)2 0) 6, 4,5}? l i 


STATISTICS OF EXTREME ON C[O, 1] 485 


Similarly, it can be shown that 


inf Za) zig (18/m»072, t, e£, 40^" 
fe (,)) i wk I p vt 56 
This yields 

n 


) E" sup [Zn (f)— Zn (8) 
i=l) SBE F US) 


2 
< nE*( sup Z ) 
i feF dj) ig ret (1,7) ne 


n il 
S yf, (1k/n)01- g eE) * ^^ 


Q-D£ 
T lisupya ca s 600/0204 5 gE)? 


842 
v Liz, (18)(k/n)>01+2,t,€E, a9") 
n 4] 2 
x: z Elg (18)(k/n)701-,7, EU p - Liz, (18)/n) 202,5, c E, ,)9 P) 
t NT 20+1)2 
T-P sup  4&(tf)2 20,5, € Es 330 
l8 «t «(l--1)8 n 
= T1 + T5. 
We have 


n 1 , 2 
Ti < ;E(9v* P — "NM + 07105542. Q5)(/my-67-1)) 


] 1 
2(j-1)8 "E MEN 2jp m 

< 20 d rri PE 2i zi) 

«gra gc 1 v agi( 1.1 

> 7- l 0J-1  gİi+2 


«3(K& -3K8)x =e, 
and for large n, 


k 
P| um. Oe E, Jo? 0f 
te[I8,(14-1)8) n 





n k 
< =P} sup G(t)— > uela é Ess 
k n 


sp GO 7 |a?#6%F 
tells, (L4-1)8) 


re[16,¢+1)6) 


af < 


1 \ -Q*28)/0—28) 15 
<C-ci(log 5) $38. 


ô 


486 J. H. J. EINMAHL AND T. LIN 


Hence, we have shown (28). 

It is easy to see that the number of elements of the partition is bounded by 
exp(2/e), which leads to (30). Hence, by Theorem 3.2 we have proved the asymp- 
totic tightness condition. 

It remains to prove that the finite-dimensional distributions of $; (Zn, — 
E Zn,i) converge weakly. This follows from the fact that multivariate weak conver- 
gence follows from weak convergence of linear combinations of the components 
and the (univariate) Lindeberg—Feller central limit theorem. It is easily seen that 
the Lindeberg condition is fulfilled for the linear combinations, since the f; y are 
made up of indicators and hence bounded. 

The fact that Z is bounded and uniformly d-continuous follows from the general 
theory of weak convergence and properties of Gaussian processes; see Section 1.5 
in [13]. D 


PROOF OF COROLLARY 2.2. Write Vir = Cn—k,n(t)=. We first show the re- 
sult for œ = —1, that is, 























] 
(34) sup vk( = ) + W(C;1)| > 0. 
Q<t<] Vn.t 
Clearly, 
] 
sup Zi — 1) + wy(t, Var) 5 0, 
0xizxl Vat 
so (17), with B = 0, yields 
1 
sup Zi — 1) + W(C, v. ,) £5. 
0<:1<! Vn,t i 








Now by the boundedness and uniform d-continuity of W we obtain (34). Finally, 
write 


a(t) l l Vi sed 
VEVE — 1) = k(V,] — pct 
nyt 
Since, by (34), 
a(t) 

—] 

sup TU d ait) 5 p. 
ossi l Vy, —1 





we obtain, again using (34), (18). O 


PROOF OF THEOREM 2.3. First, from (19) we can prove, for any ¢ > 0, there 
exists Sẹ > 0 such that if v > v, and x > 1, we have, for all O € t <1, 
logU,(vx) -logU,(v) x" © — E) 
———— — ——— | fA (0) ~ H,- (x) 
( a (v)/ Uv) y~(t) [^ LORS 


<e(l +x” 08). 





(35) 


STATISTICS OF EXTREME ON C(O, 1] 487 


the proof follows along the lines of that for the one-dimensional situation in [3]; 
see also [8]. Inequality (35) implies 


logU;(vx) -logU,(v) x¥ 0—1 
a; (v)/ Ui(v) y (t) 
where C, € (0, co) is a constant. Note that 


(36) <|Ar(v)|(Ce + x^), 





1 k—1 
Mj (t) = x 9 108 Uinn (1) — log Urb (0). 
[zz] 


Hence, we have, for sufficiently large n, 


f Y (Enin (D)/ iik)? © — 1 
k o y (t) 





1* l Ün— i,n (1) à 
- Must [Cs 5 Phiny) 


: Mn. (t) 
E" (Zn—k,n (t))/ U; (Cn—k,n (t)) 


IX: Cui) oa) O 
^k. 2. y- (b 


t JArn—k, OGER: (Eze) ] 


As before, write Vn, = Sn—k,n(t)£. Next 


I (bxat/Giskat) O uel l 
Jk * n—i,n n-—k,n E 
( 2. y (t) P 7] 


efc Bur ars (4) i 


= Viv, © JN em (sz ero dx — f ace dx) 
Wok NK l 


py PO 2 
d vd " wn(t,x)x’ “| dx 
Vat 


(37) 


E o 1. us 
- AK(V, )' es ) f x" 0-2dy + Mk || xY O- dy, 
Vnt Vn.t 


488 J H.J. EINMAHL AND T. LIN 


So 
k—1 Y (0. 1 
4 : y^ Cazin fina (0) JS S )- P(t) 
1=0 y (t) ly (t) 
i oo = 
= Var ae | (wn (t, x) — W(C;,x))x” (0-1 dx 
Vat 
= oO d 
ur eua | W(C,)x* €^! dx 
(38) Vn,t 


ast o 
+ (kv 9 =D «y- Wt) | x7 Oax 


n,t 


TC 
4 (vi xY O- dx + WCC) 
Vnt 


Vit " Vat 
= | W(C,.x)x” Onl dx +y (0W(C,1) | YO- gy, 
] 1 
From Theorem 2.1 we obtain, for the first term on the right-hand side in (38), 


sup VM 
t€[0,1] 





oo = 
| (wa (t,x) — W(C, x 0-1 dx 
Vat 





(39) < sup V7 C. sup xP£|w,(t, x) - W (Crx) 
t €[0, 1] telO, 1], x 3 V, 


OO 
x sup y% 9-1-2 dy, 
1 €[0,1] 4 Vas 
Now it follows from Theorem 2.1 with 6 positive (this is crucial) and Corollary 2.2 
that the right-hand side of (39) converges to 0 in probability. It readily follows from 
Corollary 2.2 that the five other terms on the right-hand side of (38) converge to 0 
in probability. So we have 


Sup 
Oxt«l 


k—1 l “H 
ve 1 3 niin O/a O P = 1 
k y (0) 





(40) = 5) — Pit) 





For the remainder term of 


Mn? (t) 
at (En—k,n (t))/ U, (Un—k,n (t)) 


STATISTICS OF EXTREME ON CIO, 1] 489 


in (37), note that we obtain from Lemma 3.2 in [5] that, for 0 < € < 1, 














[S esM 3 
(41) sup |— (2 e) — 5 as n — oo. 
0t Ik 9 En kn) Esse 
It can be derived from the second-order condition (19) and Corollary 2.2 that 
A;(n/k) " B 0. 
O<t<1 At(Cn—k,n (t)) 








Using this in combination with (20) and (41), we see that the remainder term 
in (37) is negligible, so we obtain that 











(1) 
(t) 1 P 
42 [s(n CUN. TR UR 0 
us Zr Crunt)/UsGraa®) 1— i D 
as n — oo. Similarly, 
MP 
JE (t) 
uel «— (Cn—k, n (0)/ U; (Sn—k, n(t)))* 
(43) à 
P 
———————— nIÓ $10) 0 
(—y-@d— 58) e) Ei 


as n — oo. Hence, we get 


(44) sup VET EO — y^ (D) — M(0| > 0 


0zxr x1 
as n — oo, where 
M(t) —-2( -y A (1 -2y7 (0) () - 41 — y- (OY (1 -2y- (0) a (0. 
We now prove (22). Write 
VKk(£ (t) — yt) — y * OP(O) 
alnn) (y MPA) 1 
mc A Ak ee E eei 

U, Cran) ( ERT Caan) 1- "x e) 


Ar (Cn—k, n (t)) yt 1 
* ect 4D) ©) y 


at (čn—k,n(t)) ds ) 
Bara LN Good AMEN Pit). 
ü rar á 9) v) 
If we show that 
nk) 4 ) P 
as ee ees O). 








490 J.H.J BINMAHL AND T. LIN 


then (22) follows from (42). We have 
at (En -k n (0) d 
Epl oc ec 
"i rwr RAO 
= vE( üt (n/ Kk) n 20) a, (Cn—k,n (t))/ U, (En—k,n (t)) 





U;(n/k) a; (n/k)/U;(n/k) 
Gt (Cn —k n ()/ U; (nk n (t)) k ye) 4+ 
pp c E ep) 
E "s aj (n/ k)/ Un] k) ( ý ex) )y " 


kN* ©) 
eins) — -1)y*o. 
From [8] and [10] it follows that (19) implies 


(ar (xv) / Ur(xv))/(aiQ)/ Ui) -x* gy PO —1 
AQ) p(t) 


as U — OQ, 


(46) 


uniformly in f € [0, 1] and locally uniformly in x > 0. Using (21), (20) and Corol- 
lary 2.2, we indeed obtain (45) and, hence, we have proved (22). Finally, we obtain 
(23) from (22) and (44). 

For (24), note 


Ü,(n/k) — Ur (n/ k) 
pu E a 
" a; (n/ k) 
a fy 28 Urn ka (0) — log Urtn/ k) 
a; (n/ k)/ Ut (n/ k} 


x (os). (Eze 1) 


ntal) o Eck) — Ui (n/K) ain] E) 
U, (n/ k) a; (n/ k) U;(n/k) i 


From Lemma 3.4 in [5] we obtain 








and 








a, (n/ k) 

















— KO — 0. 
nro qeret |Uz(n/k) ^ 
Combining this with (14) yields 
a amico 
0<t<1 U, (n/ k) 


STATISTICS OF EXTREME ON C(O, 1] 491 











Hence, 
Es] 
emos) (mem) o mo 


A proof similar to the one leading to (42) shows 
TA log Ur Sn—k,n(t)) — log U; (n/ k) 
Oxrz1 a; (n/ K)/ U,(n/ k) 


So we have obtained (24). 
For (25), we use 


á; (n/ k) 
CS i 7 
| a; (En kn (E) 
= et ee 2 ilbcicasod Loi 
PORT U; (£n—k,n) I =p) Yn ©) ay (n/ k) 
i Qt (En—k,n(t)) 


SARO O E T ea 


P se( Gata B i). 


-uUlo as n — oo. 














a; (n/ k) 
Now 
jur ma 
(ee (ae) a 
+ (erus) - 1} CUM 


From (46) and (20), we know the first term tends to 0 in probability, uniformly in 
t € [0, 1]. Hence Corollary 2.2 and (24) yield 


Qt (En kn (t)) _ _ P 
BC cam a ii 


Using (42), (44) and Theorem 1.1, (25) now follows. LJ] 





Acknowledgments. The research of the second named author was performed 
at Erasmus University, Rotterdam, and EURANDOM, Eindhoven. We are grateful 
to Laurens de Haan for stimulating interest during the preparation of the paper and 
for several useful comments. We also thank two referees and the Coeditor for their 
constructive remarks. 


492 J H. J. EINMAHL AND T. LIN 


REFERENCES 


[1] BALKEMA, A. A and DE HAAN, L (1988). Almost sure continuity of stable moving average 
processes with 1ndex less than one Ann. Probab 16 333-343. MR0920275 
[2] BINGHAM, N., GOLDIE, C. and TEUGELS, J. (1987). Regular Variation. Cambridge Univ. 
Press. MR0898871 
[3] CHENG, S. and JIANG, C (2001). The Edgeworth expansion for distributions of extreme val- 
ues. Sci. China Ser A 44 427—437. MR1831445 
[4] DE HAAN, L. and LIN, T. (2001). On convergence toward an extreme value distribution in 
C [0, 1]. Ann. Probab. 29 467-483. MR1825160 
[5] DE HAAN, L and LIN, T. (2003). Weak consistency of extreme value estimators ın C[O, 1]. 
Ann. Statist. 31 1996-2012. MR2036397 
[6] DE HAAN, L and PEREIRA, T. T (2006) Spatial extremes: Models for the stationary case. 
Ann. Statist. 34 146—168 
[7] DE HAAN, L. and SINHA, A. K. (1999). Estimating the probability of a rare event. Ann. Statist. 
27 732—159. MR1714710 
[8] DE HAAN, L. and STADTMULLER, U. (1996). Generalized regular variation of second order. 
J. Austral. Math. Soc Ser. A 61 381-395. MR1420345 
[9] DEKKERS, A. L. M , EINMAHL, J. H. J. and DE HAAN, L. (1989). A moment estimator for 
the index of an extreme-value distribution. Ann. Statist. 17 1833-1855 MRI026315 
[10] DREES, H. (1998). On smooth statistical tail functionals. Scand. J. Statist. 25 187—210. 
MR1614276 
[11] GINÉ, E., HAHN, M. and VATAN, P (1990). Max-infinitely divisible and max-stable sample 
continuous processes Probab, Theory Related Fields 87 139-165. MR1080487 
[12] GOMES, M. I., DE HAAN, L. and PESTANA, D (2004). Joint exceedances of the ARCH 
process. J. Appl Probab. 41 919—926. MR2074832 
[13] VAN DER VAART, A W. and WELLNER, J A. (1996). Weak Convergence and Empirical 
Processes With Applications to Statistics. Springer, New York. MR1385671 


DEPARTMENT OF ECONOMETRICS DEPARTMENT OF MATHEMATICS 
AND OPERATIONS RESEARCH XIAMEN UNIVERSITY 

TILBURG UNIVERSITY POSTCODE 361005 

P O. Box 90153 XIAMEN 

5000 LE TILBURG CHINA 

THE NETHERLANDS E-MAIL, weimoghult@tom com 


E-MAIL jh. einmahl@uvt nl 


The Annals of Stanstics 

2006, Vol 34, No 1, 493-522 

DO! 10 1214/009053605000000840 

© Insntute of Mathematical Statistics, 2006 


STABLE LIMITS OF MARTINGALE TRANSFORMS WITH 
APPLICATION TO THE ESTIMATION OF GARCH PARAMETERS 


By THOMAS MIKOSCH! AND DANIEL STRAUMANN 
University of Copenhagen and ETH Zürich 


In this paper we study the asymptotic behavior of the Gaussian quasi 
maximum likelihood estimator of a stationary GARCH process with heavy- 
tailed innovations. This means that the innovations are regularly varying with 
index o € (2, 4). Then, in particular, the marginal distribution of the GARCH 
process has infinite fourth moment and standard asymptotic theory with nor- 
mal limits and ./n-rates breaks down This was recently observed by Hall and 
Yao [Econometrica 71 (2003) 285—317]. It is the aim of this paper to indicate 
that the limit theory for the parameter estimators 1n the heavy-tailed case nev- 
ertheless very much parallels the normal asymptotic theory. In the light-tailed 
case, the limit theory is based on the CLT for stationary ergodic finite vari- 
ance martingale difference sequences. In the heavy-tailed case such a general 
result does not exist, but an analogous result with infinite variance stable hm- 
its can be shown to hold under certain mixing conditions which are satisfied 
for GARCH processes. It ıs the aun of the paper to give a general structural 
result for infinite variance limits which can also be applied in situations more 
general than GARCH. 


1. Introduction. The motivation for writing this paper comes from Gaussian 
quasi maximum likelihood estimation (QMLE) for GARCH (generalized autore- 
gressive conditionally heteroscedastic) processes with regularly varying noise; we 
refer to Section 4 for a detailed description of the problem. Recall that the process 


p q 
(L1) X;=o;Z,  witho? —oo-- œX}, +} yo? t eZ, 
i=] j=l 


is said to be a GARCH(p,q) process [GARCH process of order (p, q){. Here 
(Zi) is an i.i.d. sequence with EZ = ] and EZ; = 0, and o,, B, are nonnegative 
constants. GARCH processes and their parameter estimation have been intensively 
investigated over the last few years; see [19] for a general overview and [28] and 


Received November 2003; revised March 2005. 

‘Supported ın part by MaPhySto, the Danish Research Network for Mathematical Physics and 
Stochastics, DYNSTOCH, a research training network under the Improving Human Potential Pro- 
gramme financed by the Fifth Framework Programme of the European Commission, and by Danish 
Natural Science Research Council (SNF) Grant 21-01-0546. 

AMS 2000 subject classifications. Primary 62F12, secondary 62G32, 60E07, 60F05, 60G42, 
60G70. 

Key words and phrases. GARCH process, Gaussian quasi-maximum likelihood, regular varia- 
tion, infinite variance, stable distribution, stochastic recurrence equation, mixing. 


493 


494 T. MIKOSCH AND D. STRAUMANN 


the references therein for parameter estimation in GARCH and related models. 
In the context of OMLE, the asymptotic behavior of the parameter estimator is 
essentially determined by the limiting behavior of the quantity [see (4.13)] 

1 hi (80) 


L' (69) = — Ge ty. 
(80) P» E (Zr — 1) 





where L’, is the derivative of the underlying log-likelihood, k; is the derivative of 
a when considered as a function of the parameter 0, and 89 is the true parameter 
(consisting of the a; and B; values) in a certain parameter space. In this context, 


_ hi Go) 
o? 


G; € Z, 

is a stationary ergodic sequence of vector-valued random variables which is 
adapted to the filtration F; = o (Y; 1, Y;~2,...), t € Z, where Y, = z2 — ] consti- 
tutes an 1.1.d. sequence. 

If G; has a finite first moment, the sequence (G, Y;) is a transform of the mar- 
tingale difference sequence (Y;), hence, a stationary ergodic martingale difference 
sequence with respect to (F;). If E|G,|* < oo and E Y < oO, an application of 
the central limit theorem (CLT) for finite variance stationary ergodic martingale 
differences (see [4], Theorem 23.1) yields 


n 
n- "^ Y Gy, 5 NO, Y), 
t=] 
where © is the covariance matrix of G; Y;. This result does not require any addi- 
tional information about the dependence structure of (G; Y;). It implies the asymp- 
totic normality of the parameter estimator based on QMLE. 

If EY? = oo, a result as general as the CLT for stationary ergodic martingale 
differences is not known. However, some limit results for stationary sequences 
with marginal distribution in the domain of attraction of an infinite variance stable 
distribution exist. We recall two of them in Section 2. Our interest in infinite vari- 
ance stable limit distributions for ? .., G; Y; is again closely related to parameter 
estimation for GARCH processes. Recently, Hall and Yao [16] gave the asymp- 
totic theory for OMLE in GARCH models when E Zi = OO. To be more specific, 
they assume regular variation with index a € (1, 2) for the distribution of 2: It is 
our aim to show that their results can be obtained by a general limit result for the 
martingale transforms 5 7 ., G;Y; when the i.i.d. noise (Y;) is regularly varying 
with index a € (1, 2). The key notions in this context are regular variation of the 
finite-dimensional distributions of (G, Y;) and strong mixing of this sequence; see 
Section 2 for these notions. 

Our objective is twofold. First, we want to show that the theories on parameter 
estimation for GARCH processes with heavy- or light-tailed innovations (Z;) par- 
allel each other. We use the recent structural approach to GARCH estimation by 


STABLE LIMITS AND GARCH 495 


Berkes et al. [3] in order to show that such a unified approach is possible. Second, 
our approach to the asymptotic theory for parameter estimators is not restricted 
to GARCH processes. In the light-tailed case, Straumann and Mikosch [28] ex- 
tended the approach by Berkes et al. [3], including among others AGARCH and 
EGARCH processes. The main difficulty of our approach when infinite variance 
limits occur is the verification of certain mixing conditions. In contrast to the case 
of asymptotic normality, such conditions cannot be avoided. However, it is difficult 
to check for a given model that these conditions hold; see Section 4.4 in order to 
get a flavor of the task to be solved. 

GARCH processes and their parameter estimation give the motivation for this 
paper. The corresponding limit theory for the QMLE with heavy-tailed innova- 
tions can be found in Section 4. Our main tool for achieving these limit results is 
based on asymptotic theory for martingale transforms with infinite variance stable 
limits. This theory is formulated and proved in Section 3. It 1s based on more gen- 
eral results for sums of stationary mixing vector sequences with regularly varying 
finite-dimensional distributions. This theory is outlined in Section 2. 


2. Preliminaries. In this section we collect some basic tools and notions to 
be used throughout this paper. First we want to formulate a classical result on infi- 
nite variance stable limits for 1.1.d. vector-valued summands due to Rvaceva [25]. 
Before we formulate this result, we recall the notions of stable random vector and 
multivariate regular variation. The class of stable random vectors coincides with 
the class of possible limit distributions for sums of i.i.d. random vectors, and mul- 
tivariate regular variation is the domain of attraction condition for sums of 1.i.d. 
random vectors. Then we continue with an analog of Rvaceva’s result for station- 
ary ergodic vector sequences. In this context, we also need to recall some mixing 
conditions. 


Stable random vectors. Recall that a vector X with values in Rf is said to be 
a-stable for some a € (0, 2) if its characteristic function is given by 
E el (x, X) 


exp|- f, os DIC — isign( y)) anra 2) 
(2.1) x l'(dy) t i(x, i). al, 
2 
exp[-f,, , o (1 i sign((x, y) log (x, y)1) 


x l'(dy) + i(x, wh, es 


where (x, y) denotes the usual inner product in R? and | - | the Euclidean norm; 
see [27], Theorem 2.3.1. The index of stability a € (0, 2), the spectral measure 
l' on the unit sphere S^! and the location parameter yz uniquely determine the 
distribution of an infinite variance o-stable random vector X. 


496 T. MIKOSCH AND D. STRAUMANN 


Multivariate regular variation. If X is a-stable for some æ e (0, 2), it is regu- 
larly varying with index o. This means the following. The random vector X with 
values in R is regularly varying with index a > 0 if there exists a random vector 
© with values in the unit sphere S4-1 of RË such that for any f > 0, as x — oo, 

POX] » ix, X €) v 


(2.2) — PXi»xbo —t *P(O0«), 


where for any vector x Æ 0, 
X = x/|x|, 


and — denotes vague convergence in the Borel o -field of S4-1. see [22, 23] for 
its definition and details. The distribution of © is called the spectral measure of X. 
Alternatively, (2.2) is equivalent to 


P(Xex) v 


(2.3) PUXI ^ 


where — denotes vague convergence in the Borel o-field of R? X (0) and p is 
a measure on the same oc -field satisfying the homogeneity assumption w(tA) = 
t "u(A) for t > 0. 


REMARK 2.1. The property of regular variation of X with index a does not 
depend on the chosen norm. However, the spectral measure (the unit spheres S77! 
depend on the norm) and the limiting measure u can be different for distinct norms. 
The asymptotic theory of this paper does not depend on the particular choice of 
the norm | - |. Unless specified otherwise, we will, however, assume that | - | is the 
Euclidean norm. 


To give some intuition on regular variation of a vector X, we mention some 
immediate consequences of the definition. Regular variation of X implies that [X] 
is regularly varying: P(X| > x) = L(x)x *, where L(x) is slowly varying in the 
sense that L(cx)/L(x) — 1 as x — oo, for every c > 0. This property follows 
by plugging the set S?~! into (2.2). Moreover, relation (2.3) implies that every 
linear combination (a, X), a Æ 0, of the components of X is regularly varying 
with the same index a. This follows by plugging the d-dimensional halfspace {x € 
R : (a, x) > 1} into (2.3). 

Definition (2.2) has an equivalent sequential analog in the following sense. 
Choosing any sequence a, — oo such that 


(2.4) nP(|X| > a4) > 1, 
(2.2) is equivalent to 
(2.5) nP(Xi»ta,XeS$)—t*P(GcS) 120, 


STABLE LIMITS AND GARCH 497 


for all Borel sets $ C S?~! with P(O e 35) = 0. By an application of Poisson’s 
limit theorem, the latter relation implies for an i.i.d. sequence (X,) with the same 
marginal distribution as X that the binomial random variable 


Nn ((t, oo) x S) 
(2.6) 


= Y Ta,00)xs((az X: l, X) + N(Q,00) x S), 


I=} 


where the limiting variable is Poisson with parameter ? ^ P(O e S) and I4 de- 
notes the indicator function of A. This binomial variable counts those exceedances 
of the scaled lengths a; !|X|,..., a; !|X,| of the vectors X, above the thresh- 
old t for which the angles of the X;'s fall into the set S. The distributional con- 
vergence (2.6) can be extended to the weak convergence of the underlying point 
processes N, toward a Poisson process N on R? \ (0), jz being its mean measure; 
we omit the details and refer again to the mentioned literature [22, 23]. However, 
the limit relation (2.6) already explains to some extent what the spectral measure 
describes (in an asymptotic sense): it gives the likelihood that the angles of the 
i.i.d. regularly varying vectors X1,..., X, "far away from the origin" fall into a 
specified set S. 

The Poisson convergence result (2.6) also tells us what "far away from the ori- 
gin" means: the scaling a, of the X,’s has to be chosen according to the condi- 
tion (2.4). We see in the sequel that this condition will appear in various disguises. 
Finally, we mention that (2.3) can be written in equivalent sequential form with 
(an) satisfying (2.4) as 


nP(a. X e -) > wl). 


Stable limits for sums of i.i.d. random vectors. Now let (¥;) be an i.i.d. se- 
quence of random vectors with values in R7. According to Rvateva [25], there 
exist sequences of constants a, > 0 and b, € R? such that 


n 
T d 
ad, l yY; Ee: b, — Xa 
t=] 
for some a-stable random variable X, with o € (0, 2) if and only if Y; is regularly 
varying with index o, and the normalizing constants a, can be chosen as 


(2.7) P(Yi| >an) ~nt. 


Notice that (2.7) is directly comparable with condition (2.4), which appears in the 
sequential definition of regular variation. 

For a stationary sequence (Y,), a similar result can be found in [13] as a multi- 
variate extension of one-dimensional results in [12]. For its formulation one needs 
regular variation of the summands and a particular mixing condition, called A (a), 
which was introduced in [12]. 


498 T. MIKOSCH AND D STRAUMANN 


Mixing conditions. We say that the condition A(a,) holds for the stationary 
sequence (Y;) of random vectors with values in R¢ if there exists a sequence of 
positive integers r, such that rn — oo, kn = [n/r] > oo as n — oo and 


n rn Kn 
Eol- f Gia us (£e -- fGao l} — 0, 


t=] fl 


(2.8) 
n—>oo, Vf € Gs, 


where 9, is the collection of bounded nonnegative step functions on IR? X (0). The 
convergence in (2.8) is not required to be uniform in f. This is indeed a very 
weak condition and is implied by many known mixing conditions, in particular, 
the strong mixing condition which is relevant in the context of GARCH processes; 
see Section 4. We refer to [13] for a comparison of A(a,) with other mixing con- 
ditions. 

For later use we also recall the definition of a strongly mixing stationary se- 
quence (Y,) of random vectors with rate function ($&) (see [24], cf. [14] or [17]: 


sup IP(A N B) — P(A)P(B)| =: dj 0 as k — oo. 
Aco (Y,,s <0), Beo (Y,,s-k) 
If (jy) decays to zero at an exponential rate, then (Y,) is said to be strongly mixing 
with geometric rate. In Section 4.4 we use a more stringent notion of mixing, 
called B-mixing or absolute regularity. It implies strong mixing with the same rate 
function. 


Stable limits for sums of stationary random variables. The following result is a 
combination of Theorem 2.8 and Proposition 3.3 in [13]. It gives conditions under 
which an a-stable weak limit occurs for the sum process of a stationary sequence. 
In what follows we write 


So —0 and S, =Y; +- +Y, n>], 
and for any Borel set B CR, 
Sn B = (Sr (B). a 


where 


n 
SB) = YP rnv? ya), — n1. 


t=] 


THEOREM 2.2. Let (Y;) be a strictly stationary sequence of random vectors 
with values in R and the real sequence (an) be defined by (2.7). Assume that the 
following conditions are satisfied: 


STABLE LIMITS AND GARCH 499 


(a) The finite-dimensional distributions of (Y,;) are regularly varying with index 
a > 0. To be specific, let vec(0 LUE d ,00) be the (2k + 1)d dimensional ran- 
dom row vector with values in the unit sphere SC**Ud-1 that appears in the 
definition (2.2) of regular variation of vec(Y 4, ..., Yz), k > 0, with respect 
to the max-norm | -| in RCE&*04. 

(b) The mixing condition A(a,) holds for (Y;). 

(c) 





(2.9) jim imsp P( V IY] > anvil Yo! > a) zs, y>0, 
A ERS kx|t| rn 
where (r4) appears in the formulation of A(ay,). 


Then the limit 
(2.10) y = Jim E(P- Vere) JEP 


exists. If y > 0, then the following results hold: 
(1) Ifa € (0, 1), then 


= d 
à, Sn > Xa, 


for some a-stable random vector Xx. 
(i) Ifa € [1, 2), and for all ó > 0 


(2.11) lim lim sup P (IS, (0, y] — ES,(0, y]] > dan) — 0, 
y»0 n—-oo 


then 
a; (S, — ES,(0, 1]) 5 X, 


for some a-stable random vector Xu. 


REMARK 2.3. The structure of the limiting vectors X, is given by some func- 
tional of the points of a limiting point process. The proof of this result makes heavy 
use of point process convergence results, which are appropriate tools in the context 
of regularly varying distributions when extremely large values may occur in the se- 
quence (Y,); see [13] for details. This leaves the parameters in the characteristic 
function (2.1) unspecified (with the exception of a); a specification is not available 
so far and requires further investigation. 


REMARK 2.4. The quantity y in (2.4) can be identified as the extremal index 
of the sequence (|Y;!); see [12] and Remark 2.3 in [13]. The extremal index y € 
[O, 1] of a strictly stationary real-valued sequence is a number which characterizes 


500 T MIKOSCH AND D. STRAUMANN 


the clustering behavior of the sequence above high thresholds. Roughly speaking, 
its existence ensures that the approximate relationship 


P( max (Y, < un) ~ P™ Ys < un) 
[—1,..., 

holds for suitable sequences un — oo). For the definition and interpretation of the 
extremal index, we refer to [18] and [15], Section 8.1. The case y = 0 corresponds 
to the case of sequences with unusually large cluster sizes above high thresholds. 
This case is often considered pathological; see [18] for some examples and the re- 
cent paper by Samorodnitsky [26]. For y — 0 the limit theory developed in [12, 13] 
yields that the weak limit results in the above theorem hold with zero limit. 


3. Stable limits for martingale transform. In this section we want to derive 
infinite variance stable limits for sums of strictly stationary random vectors which 
have the particular form 


Y, = Gi, 


where (Y;) is an i.i.d. sequence and (G+) is a strictly stationary sequence of ran- 
dom vectors with values in R^ such that (G;) is adapted to the filtration given 
by the o-fields F; = o(¥;~-1, Y;-2,...), t € Z. If EY; — 00 and E|G1| < oo, 
E(G,; Y;|2;) = 0 a.s., and, therefore, (G,;Y;) is a martingale difference sequence 
and 

So = 0, S5 — Yide: Y, n 1, 


is the martingale transform of the martingale (5 77. Y:)n>0 by the sequence (G;). 
We keep this name even if E|Y;| = oo. 

3.1. Basic assumptions. We impose the following assumptions on the se- 
quences (Y;) and (G;): 


A.1. Y; is regularly varying with index o € (0, 2). 

A.2. E|G4|*** < oo for some e > 0. 

A.3. (G,Y,) satisfies condition A(a,,) [see (2.8)], where P(|Yi1] > a4) ^" n^! and 
(Ta), defined in (2.8), is such that 


ar a-Fe 
(3.1) nra (=) — 0, 
where € is the same as in A.2. 


REMARK 3.1. Regular variation of Yı with index o and the 1.1.d. property of 
(Y,) imply that 


BNET. 
P a7! max |YX;| <x | 5 byxx) =e , x>0, 
1<t<n 


for the Fréchet distribution Py; see [15], Chapter 3. 


STABLE LIMITS AND GARCH 501 


In this setting, the heaviness of the tails of the distribution of G4 Y; is essentially 
determined by the distribution of Y1; see Remark 3.4 below. 


3.2. Main result. We are now ready to formulate our main result on the as- 
ymptotic behavior of the sum process (S,). 


THEOREM 3.2. Consider the martingale transform 


v), .- EO") 
i=] n>0 t=] nz0 


defined above. Assume that the conditions A.1—A.3 are satisfied. Moreover, if a € 
(1, 2), assume that EY, = Q and, if a = 1, that Y, is symmetric. Then the finite- 
dimensional distributions of (Y,) are regularly varying with index a and the limit y 
in (2.4) exists. If y > O, then 


(3.2) a; 18, + Xa, 
where the sequence (an) is given by 

P(Y1] > an) «n^ 
and X, is an a-stable random vector. 


REMARK 3.3. In the case when E|G4|^*? + E|Y;|^*? < oo and EY; = 0, 
(3.2) turns into n V 26, Ed X, where X is Gaussian with mean zero and the same 
covariance structure as G4. This follows since (G;Y;) is a strictly stationary mar- 
tingale sequence; see [4]. 


REMARK 3.4. It is not difficult to see that Y; is regularly varying with in- 
dex a. For the proof we need a result of Breiman [11]. It says that if one has two 
independent random variables £, n > 0 a.s., & is regularly varying with index a > 0 
and En” < oo for some v > a, then 


P(En > x) - En" P(E > x), 


that is, £n is regularly varying with the same index a. Now observe that, for 
t,x > 0 and a Borel set S C S4^1, by multiple application of Breiman’s result, 


P(IGilIYi| > tx, G1Y1/IGillY1] € 5) 


P(IG1l|Y1| > x) 
_ P(IGillYi] > tx, sign(Y)G, € S) 
B P(IGi||Y1] > x) 
|. P(GilYi 21x, G1€ $) | P(IGilYi < —tx, —G; € S) 
|. P(IGillY1] > x) P (IGillYil > x) 


. E(Gil Is(G)) PQa > tx) , E(GiP Is (-GD) PQA < —tx) 
E|G;|* P(|Y1] > x) E|G,|* P(IY1] > x) 


502 T MIKOSCH AND D. STRAUMANN 


Writing for some p,q > 0 with p +q = 1 and a slowly varying function L(x), 
P(Y -x)—pL(x)x * and P(Y,;x-—x)-qL(x)x| ?, x » 0, 
we can read off the spectral measure of the vector Y: 


E(Gil*Is(Gi) |  E(\Gy|%Is(—G1)) 


3.3 P(OecS)— 
va (969=P— Ele E|G;]* 


By regular variation, a, = n!/*£(n) for some slowly varying function £. By 

Breiman's result and since E|G,|*** < oo for some e > 0, it also follows that 
P(IGi||Yi| > x) ~ E|Gi|" PQYi] > x), 

and, therefore, P(|Y1| > can) ^ n^! for some constant c > 0. Moreover, we have 

(3.4) nP(a, Y1€) > m, 


for some measure u4 on R? X {0} which is determined by o and the spectral mea- 
sure. 


REMARK 3.5. It follows from the proof below that 
nP (a. (Yi,..., Yh) e d(Xi, ..., x1) 
(3.5) > ui (dxi)eo(d(xo, ..., X4)) +++» + ui (dxy)eo(d(xi, ..., Xh—1)) 
=! un (d(xi, ..., X4)), 
where 44; is defined by (3.4), £o is the Dirac measure at 0 and 
(Y;,..., Yp) := vec(Yi,..., Yp) and 


(X1, (2 Xh) = Vec(X|, ..., Xh). 


(3.6) 


This means, in particular, that the limiting measure in the definition of regular 
variation for (Y,,..., Y4) is the same as in the definition of regular variation for 
vec(Y5,..., Y},), where Y; are i.i.d. copies of Y;. This part of the theorem is valid 
for any a > 0. 


PROOF OF THEOREM 2.2. We verify the conditions of Theorem 2.2. Since 
A.3 implies A(a,) and since we require y > 0, it remains to check (a) and (c) in 
Theorem 2.2. 


(a) Regular variation of the finite-dimensional distributions. We show regular 
variation of the vector (Y;,..., Y4) defined in (3.6), that is, we show that (3.5) 
holds. 

We restrict ourselves to proof of regular variation of the pairs (Y1, Y2) :— 
vec(Y;, Y2); the case of general finite-dimensional distributions is completely 


STABLE LIMITS AND GARCH 503 


analogous. The regular variation of Y; was explained in Remark 3.4. Let now 
B, and B; be two Borel sets in [0, oo] X (0), bounded away from zero. In partic- 
ular, there exists M > 0 such that |x| > M for all x € Bj; and x € B2. Then for any 
€ > 0, by intersecting with the events {|G,| < e} and {|G,| > €}, i = 1,2, 


(a; Y1€ B1,a, ! Y? € Bj) 
C (IGilIY1] > Man, IG2llYo] > Man} 
C (elYil > Man, €|Y¥2| > Man} 
U (IGilIt o0) (IG1DIYil > Man, €|Y2] > Man} 
U {]G2] 1,00) (IG2D1Y2l > Man, e|Y1] > Man} 
U {1G@1|Ze,00)(IGi |) Pi] > Man, [Gal e,o) (1G21)1Y2| > Man} 


By independence and an application of Breiman’s result, nP(D;) — O and 
nP(D3) — 0. Similarly, 


n P(D3) < nP(IG2|Ie oo (IG2DIY2] > Man) 
^nP(|Yo| > Man) E(IG2|* e,o) (1G2)). 
and thus, by Lebesgue's dominated convergence theorem, 
zB um sup n C23) ED. 
and n P(D4) — 0 can be proved in the same way. We conclude that 
nP (a; (Yi, Y2) € d(x1, x2)) > ui (dxi)eo(dxi) + 41 (dxo)eo(dx2) 
= L2 (d (x1, x2)); 
see [23]. This proves the regular variation of the two-dimensional finite-dimensio- 


nal distributions. The higher-dimensional case is completely analogous. 


(c) The condition (2.9). We have for any y > 0, 





P( max IG,IIY,I > yan | IGollYol > yan) 
k<t<r, 





< P( max IG,| > yan/ (sar, ) | |Gol|Yol > »a,) 


+ P( max |Y;| > Skar, ) 
k Xt ry 


= h +h, 


504 T. MIKOSCH AND D. STRAUMANN 


where (sg) is any sequence such that sg — oo. In what follows all calculations go 
through for any y > 0; for ease of notation, we set y = 1. Then, by Remark 3.1, 


lim lim h = jim (1 — 95 (59) = 0 
— OX 


k—o0o0n-*co 


An application of Markov's inequality yields, for some constant c > 0 and e > 0 
as in A.2 (here and in what follows, c denotes any positive constant whose value 
is not of interest), 


h 


LA 


Fn 
3 P(IG.I > an/(sear,) | IGollYol > an) 
fork 
(0 n a P (IGollYol > an) 


a+e 
SEG 
enrs( | E|Go|*** 
a 


n 


lA 





— 0 as n — oo. 
Here we used Breiman's result [11] to show that 
P (Goll Yol > an) ~ E|Go|* P (|Yo| > an), 


condition (3.1) and the fact that E|G4|* ** < oo; see A.2. 
Now we turn to 


P( max IGI as 





IGoll¥ol > an) 





< P(_ max 1G > an/(sxan) {IGoll¥ol > an ) 


4 P( max - „Yel > Skär, 


—rg St < 





IGollYol > an) 
=; Is l4. 


The quantity J3 can be treated in the same way as J; to show that J3 — Q a.s. as 
n — oo. We turn to 74. Fix 0 < M < co. Then 
hus P (max -,, «t «—X Yr] > Skarp, M|Yo| > an) 
7 P(|Go||Yo| > an) 
P (IGol (a, oo) (Gol) Yo] > an) 
P(|Go||Yo| > an) 
=: Ig, + I4. 


STABLE LIMITS AND GARCH 505 


By independence of the Y;’s, Breiman's [11] result and since r, — oo, 
P (max...r, «i «—& |Y;| > skar,) M^ P(|Yo] > an) 
E|Go|* P (|Yo| > an) 


^ c(l — Ba(sg)) as n — oo 


]aj 


— 0 as k — oo. 
By virtue of Breiman's [11] result, 
... E(IGol" Zoo IGol)) P (|Yo] > an) 
E|Go|* P (|Yo| > an) 


Since |Go| has finite moments of order greater than o, an application of the 
Lebesgue dominated convergence theorem yields 


hm lim Ig) =0. 


M — 00 11> 00 


142 


This proves (2.9). LJ 


Thus, the conditions (a)-(c) and y > 0 of Theorem 2.2 are satisfied. In the 
case a < 1, Theorem 2.2 immediately yields (3.2). In the case œ € [1, 2), we 
have to check condition (2.11). It suffices to show it for components s? (0, y], 
i —1,...,d, of S,(0, y]. Since the components can be handled in the same way, 
we suppress the dependence on i and, for ease of notation, write G; Y; for the 
summands of the ith component. 

We start with the case œ € (1,2). As before, write F; = o (Y; 1, Y;—2,...). 
Then, for z > 0, since EY; = 0, 


E|G;YiIt,z((G:Y;|/as) | Fi] = GrE[YrIoo, y (1GrY:]/an) | Ge] 
= —G, E| Y; Ie, o0) (IG? ¥t|/an) | G;]. 


Consider the decomposition 


n 


a, Y [Gi Yigg (1GcY:/as) — E[Gi1Y11(,4(1G:1Y11/25)]] 
i=] 


a4 | Y [GY Io, (1GiYsl/an) — Gr E[Y; Ho, (1GsY«|/ag) | Gr] 


t=1 


Í 


sa” S [Gr E[Y; Ie, o (1G: Y:|/an) | G;| = E[G1Y¥11¢z,00)(|G1¥1|/an) |] 


i=l 
-e A 7). 


For fixed n, Tj 1s a sum of stationary mean zero martingale differences. An ap- 
plication of Karamata's theorem ([5], page 26) to the regularly varying random 


506 T. MIKOSCH AND D. STRAUMANN 
variable G; Y, with index œ yields for some constant c > 0, 


var(T]) = na, ^ E[GiYiI(,a(1G1Yil/ag) 
— Gi E[¥i Io, (1Gi1Y1/as) | Gil]? 


= 2 
(3.7) < cna, E[G1Yilo,(1G1Y1l/an)] 
~ cz4 8 as n — oo 
—0 as z | 0. 


Next we treat 72. Fix 0 < à < M < oo to be chosen later. Notice that, by Kara- 
mata's theorem and the uniform convergence theorem for regularly varying func- 
tions uniformly for c € [6, M], 


ELYi(ex ooy CI Y11)] 
cx P(|Y1| > cx) 


for some constant C. Taking this into account, the strong law of large numbers 
yields, with probability 1, 


a, | 3 Gp My Gro E[YoH oo) (IG1¥e1/an) | Gr] 


t=1 


n 
=a; 3 Gylts,m\ (Gel) 


t=] 


(3.8) x [(zan/ G) P(Y; > zas/1Gil | GO(C + 0(1))] 


= (C + 0(1))z en! Y Gil" rs, Gel) 


i=l 


> Cz" * E[IGil" Iis, (1G1))]. 


On the other hand, since G7j5, uj (|G1]) Y1 is regularly varying with index o € 
(1, 2), by the same argument and Breiman's result, 


na. ' E[G1 Iy, MY( Gi Yi, ooy (1Gi1Yi1/aa)] 
(3.9) = na, ![(C + 0(1))(zan) P (Gilis, m (1G1D1Y1l > zan)] 
= (C + o(1))zl^* E[IGiI* fps uy (1G11)]. 


This shows that (3.8) and (3.9) cancel asymptotically as n — oo for every fixed z. 


STABLE LIMITS AND GARCH 507 


A similar argument shows that, with probability 1, 


n 
a4 |Y Gips (IGiD E[YsMEo oo (1GrYs|/an) | Gr] 
t=1 


n 


(3.10) < a, | $ IG, Gi E[IYi e,o) (81Y11/a2)] 


tz 
— c(z/8) ^ E[IG1IIp,s(1G10]. 
Moreover, 
na; |E[G:Tp,s (G1) Y1Z, oy (1G1Y1]/as)]] 
(3.11) < na, | E[1G11djo, (1G1DIY1 oo) (B1Y11/a2)] 
~ e(2/6)'* E[|G1 |Jfo,8}(1G 11]. 


Now choose ô = z^. Then, first letting n — oo and then z | 0, both (3.10) and 
(3.11) vanish asymptotically. 
Finally, we consider 


n 
a, E| Y > Gelato) 0G: DE [Yo Ie, o) (1G:Y:1/an) | Gr] 
t=] 


< a. InE[IGiMq oy (G1DIY1 E, oy (IG 1Y1|/an) |- 


An application of Breiman's result to the regularly varying random variable 
G1liyu,ooy(IG1])Y1 gives that the right-hand side is asymptotically equivalent as 
n — œ to 


cz" E[IGil" Fu oo (1G10]. 


Choosing M large enough, the right-hand side is smaller than z, say. The same 
argument can be applied to 


na, | |E[G:pu, oo) 161DY1IG, o (1G1Y11/a2)]]. 
Collecting the bounds above, we see that 
lim lim sup P(|75| > r) = 0, r 0. 
z40 n> œ 


This together with (3.7) concludes the proof of (2.11) for œ e (1, 2). 

For œ = 1, we use the additional condition of symmetry of Y, Then 
ES, (0, y] = 0 and the same argument as for var(T1) above shows that (2.11) holds 
in this case as well. This concludes the proof of (2.11). 

since the conditions of Theorem 2.2 are satisfied for œ € [1, 2), we conclude 
that 


a; (S, — ES,(0, 11) 5 X, 


508 T. MIKOSCH AND D STRAUMANN 


for some a-stable random vector in R2. For a = 1, we can drop ES,(0, y] because 
of the symmetry of G,Y,. For o € (1, 2), G; Y, is regularly varying with index a. 
Since E(G, Y,) = 0, Karamata’s theorem yields 


a7! ES„ (0, 1] 5 b 


for some constant b which can be incorporated in the stable limit, and, therefore, 
centering in (3.2) can be avoided. This concludes the proof of Theorem 3.2. 


4. Gaussian quasi maximum likelihood estimation for GARCH processes 
with heavy-tailed innovations. In this section we apply Theorem 3.2 to 
Gaussian quasi maximum likelihood estimation (QMLE) in GARCH processes. 
The limit properties of the QMLE were studied by Berkes et al. [3]. They proved 
strong consistency of the QMLE under the moment condition E|Z1|^*? < oo for 
some ô > 0 and established asymptotic normality under E Zt « co. Here (Z;) is an 
1.1.d. innovation sequence; see Section 4.1 below for the definition of the GARCH 
model and the OMLE. Hall and Yao [16] refined these results and also allowed 
for innovations sequences, where Z is regularly varying with index o € (1, 2). 
Then the speed of convergence is slower than the usual ./n rate and the limiting 
distribution of the QMLE is (multivariate) a-stable. 

It is our objective to show that the asymptotic theories for the QMLE under 
light- and heavy-tailed innovations parallel each other and that very similar tech- 
niques can be applied in both cases. However, in the light-tailed case (see [3]) an 
application of the CLT for stationary ergodic martingale differences is the basic 
tool which establishes the asymptotic normality of the QMLE. In the heavy-tailed 
situation one depends on an analog of the CLT which is provided by Theorem 3.2. 

As a matter of fact, the structure of the proofs shows that the asymptotic prop- 
erties of the QMLE are not dependent on the particular structure of the GARCH 
process if one can establish the regular variation of tbe finite-dimensional distrib- 
utions of the underlying process (X;) and the mixing condition A(a,,). Therefore, 
the results of this section have the potential to be extended to more general models, 
including, for example, the AGARCH or EGARCH models whose QMLE proper- 
ties in the light-tailed case are treated in [28]. The most intricate step in the proof 
is, however, the verification of this mixing condition for a given time series model. 
We establish this condition for a GARCH process by an adaptation of Theorem 4.3 
in [21]; this yields strong mixing with geometric rate of the relevant sequence. We 
devote Section 4.4 to the solution of this problem. 

Before we start, we introduce some notation. If K C R? is a compact set, we 
write C(K, R?) for the space of continuous R? -valued functions equipped with 
the sup-norm [|v|| x = sup,ex |v(s)|. The space C(K, R1 *42) consists of the con- 
tinuous d; x d;-matrix valued functions on K ; in RA *®% we work with the operator 
norm induced by the Euclidean norm | - |, that is, 


IAI = sup|Ax, | AeR^*4, 
[|= 


STABLE LIMITS AND GARCH 509 


4.1. Definition of the QMLE. Recall the definition of a GARCH(p, q) process 
(X4) from (1.1). As before, (Z;) is an i.i.d. innovation sequence with E Z. =æ] 
and EZ; = 0, and o;, B, are nonnegative constants. GARCH processes have been 
intensively investigated over the last few years. Assumptions for strict stationarity 
are complicated: they are expressed in terms of Lyapunov exponents of certain 
random matrices; see [6] for details. A necessary condition for stationarity is 


(4.1) Bi orf <1 


(Corollary 2.3 in [6]). We will make use of this condition later. 

In what follows we always assume strict stationarity of the GARCH processes. 
As a matter of fact, the observation X; is always a measurable function of the 
past and present innovations (Zr, Zt—1, Z;—-2,...); hence, (X;) is automatically 
ergodic. 

In what follows we review how an approximation to the conditional Gaussian 
likelihood of a stationary GARCH(p, qg) process is constructed, that is, a con- 
ditional likelihood under the synthetic assumption Z; ii.d. ~ (0, 1). Given 
Xo,...,X—p41 and edo s the random variables X1,..., X, are con- 
ditionally Gaussian with mean zero and variances h;(0), t = 1,...,n, where 
0 = (29,01, ..., Æp, Bi,..., Ba) denotes the presumed parameter and 


D t <0, 
hi (8) = | o0 +a XP 9 rep XP, 
ua 0) + Bghi-g(@), t0. 


The conditional Gaussian log-likelihood has the form 
log fe( X1, erry Xn | Xo, AE X05. testo oa) 





(4.2) : $ 
= — t log(2x) — =D 2 +logh e) 
2 2 iz (6) ane 


Since Xo, ..., X_p41 are not available and the squared volatilities o, ... o2, 
unobservable, the conditional Gaussian log-likelihood (4.2) cannot be numerically 
evaluated without a certain initialization for 09,...,0%,, and Xo, ..., X 41. 
The initial values being asymptotically irrelevant, we set the X;'s equal to zero 


and Á; (8) = œ0/(1 — £1 — --- — Bg) for t < 0. We arrive at 
o/(1 — Bi — <- — By), t <0, 
(4.3) hy(0) = | do Fe X7 ,-E o + Omn(p,t—1) X oar p,1) 
+ Bia (0) +--+ Boh S (0), t>0. 


The function (Á; (0))!/2 can be understood as an estimate of the volatility at time f 
and under parameter hypothesis 0. It can be established that lh, — h,| “> 0 with 


510 T. MIKOSCH AND D. STRAUMANN 


a geometric rate of convergence and uniformly on the compact set K defined 
in (4.4) below. This suggests that, by replacing Àh,(0) by h, (0) in (4.2), we ob- 
tain a good approximation to the conditional Gaussian log-likelihood. Since the 
constant —n log(2z:)/2 does not matter for the optimization, we define the QMLE 
6, as a maximizer of the function 

n 





P" LN X: 
Ln (0) = $ £(0) = — DIe + logh, e) 
t=] 2T ] h, (8) 
with respect to 0 € K, with K being the compact set 
(4.4) K = (0 c RP'**! [m xoj, Bj < M, Bi By < B) 


where 0 < m < M < co and 0 < B < 1 are such that qm < B. 


REMARK 4.1. From a comparison with [3], one might think at first sight that 
our definition of the QMLE is different from theirs. To see that h; coincides with 
w; in [3], introduce the polynomials 


«(z)-ojz-:-Fayz^ and f(z)—1-fuz-—---— Baz? 
for every 0 = (29,01, ..., 0p, B1. ..., B4)* € K. Then one can show by induction 
on ź that 
4.5 h,(0 0)X; |, 
(4.5) O= LOH, 
where the coefficients y;(@) are defined through 
alz) © 
(4.6) =) WOZ, lel <1. 
BG) = 


Note that the latter Taylor series representation is valid because f > 0 and 81 + 

+--+ Bg EP < 1 imply B(z) 40 on K for |z| < 1+ € and e > 0 sufficiently small. 
We choose (4.3) rather than (4.5) as a first definition for the squared volatility 
estimate under parameter hypothesis 0, because the recursion (4.3) is natural and 
computationally attractive. In [3] the starting point for the definition of the QMLE 
is Theorem 2.2, which says that for all t € Z one has h; (0) = o7, where 09 is the 
true parameter and 


(4.7) h: (0) = oa ga E V; ()X7 ; 


In [3] this leads to the definition of a Mat volatility estimate at time £ under 
parameter 0 based on (X1,..., X4), which is given by (4.5). Note also that (A; (0)) 
obeys 


+ B1h (0) + +++ + Bahia 4 (0), 0 cK. 


(4.8) 


STABLE LIMITS AND GARCH Sil 


4.2. Limit distribution in the case EZ} < oo. First we list the conditions em- 
ployed by [3] for establishing consistency and asymptotic normality of 0,,. Write 
09 = (09,07, ..., Qp» Piceri pe)" for the true parameter. 


C.1. There is a 5 > 0 such that E|Zi|^*? < oo. 

C.2. The distribution of |Zi| is not concentrated in one point. 

C.3. There is a u > 0 such that P(|Zi| < t) = o(t^) as t | O. 

C.4. The true parameter @ lies in the interior of K. 

C.5. The polynomials a? (z) = az ---- oz? and B°(z) = 1— Byz—--- ssa 
do not have any common roots. 


Now we are ready to quote the main result of [3]. We cite it in order to be able to 


compare the assumptions and assertions both in the light- and heavy-tailed cases; 
see Theorem 4.4 below. 


THEOREM 4.2 (Theorem 4.1 of [3]). Let (X;) be a stationary GARCH(p, q) 
process with true parameter vector 09. Suppose the conditions C.1—C.5 hold. Then 
the OMLE 0, is strongly consistent, that is, 


S. 
6, — 609, n — oo. 


If, in addition, E Zo « oo, then ó, is also asymptotically normal, that is, 


Vn(0, — 80) > N (0, B; A9B; '), 
where the (p 4- q -- 1) x (p +q +1) matrices Ao and Bo are given by 


| E(Z5 — 1) TU 
Ac SEC  (09)7 h (69). 


(4.9) 


Ix T 
Bo — (a ^ (Øo) hi (B0)). 


4.3. Limit distribution in the case E Zi = oo. First we identify the limit deter- 
mining term for the QMLE. To this end, we set analogously to [3], 


n 2 


ol X; 
Ln (0) = Leo- iG ip tier) 


and define 6, as a maximizer of L, with respect to 0 € K. It is a slightly simpler 
problem to analyze 0, because (£+) is Stationary ergodic, in contrast to (£N. 

As is shown in Proposition 4.3 below, 6, and 6, are asymptotically equivalent. It 
turns out that the asymptotic distribution of the QMLE is essentially determined by 
the limit behavior of L, (00)/n, up to multiplication with the matrix -Bg | These 
results follow by a careful analysis of the proofs in [3]. We omit details and refer to 


512 T. MIKOSCH AND D STRAUMANN 


the website [20] for a detailed proof. Compare also with the similar reference [28], 
where the case of processes with a more general volatility structure than GARCH 
is treated. 


PROPOSITION 4.3. Let (Xj) bea stationary GARCH(), q) process with true 
parameter vector 09. Suppose the conditions C.1—C.5 apply. If there is a positive 
sequence (Xn)n>1 with x, = o(n) as n — co and 

Li (0 
(4.10) x; n Go) Ed D, n — OQ, 
n 
for an R?+9+1_yalued random variable D, then the QMLE Ó,, satisfies the limit 
relation 





(4.11) xn (Ôn — 00) > -B7 D, 
where Bo is given by (4.9). 


Now we can state the main theorem of this section. We note once again that Hall 
and Yao [16] derived the identical result by means of different techniques. 


THEOREM 4.4. Let (X;) bea stationary GARCH(p, q) process with true pa- 
rameter vector 09. Suppose that Ze is regularly varying with index a € (1, 2) and 
that C.3-C.5 hold. Moreover, assume that Z, has a Lebesgue density f, where 
the closure of the interior of the support (f > 0} contains the origin. Define 
(Xn) = (na, l), where 


P(Z?>a,)~n',  n- oo. 
Then the QMLE 6, is consistent and 


(4.12) %,(6, -00)-> De, n>, 


for some nondegenerate a-stable vector Dg. 


Before proving the theorem, we discuss its practical consequences for parameter 
inference: 


e The rate of convergence x, has—roughly speaking—magnitude n^ '/*, which 
is less than ./n. The heavier the tails of the innovations, that is, the smaller a, 
the slower is the convergence of 6 n toward the true parameter 89. 

e The limit distribution of the standardized differences (6 n — 00) is a-stable and, 
hence, non-Gaussian. The exact parameters of this o-stable limit are not explic- 
itly known. 

e Ene bands based on the normal approximation of Theorem 4.2 are false 
if EZ, — oo. 


SIABLE LIMITS AND GARCH 513 


e By the definition of a GARCH process, the distribution of the innovations Z; is 
unknown. Therefore, assumptions about the heaviness of the tails of its distri- 
bution are purely hypothetical. As a matter of fact, the tails of the distribution 
of X, can be regularly varying even if Z, has light tails, such as for the normal 
distribution; see [2]. Depending on the assumptions on the distribution of Z1, 
one can develop different asymptotic theories for QMLE of GARCH processes: 
asymptotic normality as provided by Theorem 4.2 or infinite variance stable 
distributions as provided by Theorem 4.4. 


PROOF OF THEOREM 4.4. The proof follows by combining Theorem 3.2 and 
Proposition 4.3. Indeed, setting 


G: —h,(00)/o2, | Y; 2(Z2— D/2 and Y,=G,%, 


one recognizes that 


(4.13) L! (0 y= 13, 509 OO) 221) - Gy, 


2 zi et t=] 


is a martingale transform. Regular variation of Ze with index « e (1.2) im- 
plies A.1, but also C.1 and C.2. Condition A.2 is fulfilled because ||} / hi|| k has 
finite moments of any order (Lemma 5.2 of [3]), and so has ||G,ll. The condi- 
tion A.3 holds if we can show that (Y;) is strongly mixing with geometric rate, in 
which case we choose r, = n? in A(a,) for any small ô > 0, so that (3.1) imme- 
diately follows. This choice of (r4) is justified by the arguments given in [2]. The 
strong mixing condition with geometric rate of (Y;) will be verified in Section 4.4. 

Finally, we have to give an argument for y > 0. The latter quantity has inter- 
pretation as the extremal index of the sequence (|Y;|); see Remark 2.4. Accord- 
ing to Theorem 1 3. 7.2 in [18], if y = 0 and for some sequence (un) the relation 
lim inf,. oo P (M, < ug) > 0 holds, then one neccessarily has limp-+o9 P(M, < 
Un) = 1. Here M, = max(|Yil, ...,|Y,]) and (M,) is the corresponding sequence 
of partial maxima for an i.i.d. TE (Ri), where R4 has the same distribution 
as |IY; |. 

We want to show by contradiction that y = 0, using the above result. The 
random variable |Y;| z R, is regularly varying with index o since Y, is reg- 
ularly varying with index o. Hence, (a, ! Mn) has a Fréchet limit distribution 
Pax) = exp{—x~*}, x > 0; see Remark 3.1. 

On the other hand, we will show that P(M, < xan) — 1 does not hold for 
any positive x, thus contradicting the hypothesis y = 0. Indeed, straightforward 
arguments exploiting 





2 iz] 1, 


~ a du z 
2. óa BO 


514 T. MIKOSCH AND D. STRAUMANN 


for all i — 1,..., p, show that 


dh; (0 
(4.14) om si for all i —0,...,p, 
Qo 
and 
p 
(4.15) 3 udo -—h,(0). 
i=0 c 


Since the Euclidean norm is equivalent to the 1-norm |x| — Soc |x,| and 
a, < M on K, there is ac > O such that 


IO c Æ [o C $ ðh) | 
h hO” 3m |^ X022 9. ^ 


i—0 i=0 

















Oa; 


Note that the last two equalities in the latter display are a consequence of (4.14) 
and (4.15). In particular, we proved that |G;| > c for all i and therefore 


PO xa y= PC max IG,]|X, «xa, 
i=l, .,n 


< P( max |Y;| < cl xa). 
iz] oun 
The same classical limit result for maxima as above ensures that the right-hand side 
probability converges to a Fréchet limit and is never equal to 1 for all positive x. 
Thus, we have proved y > 0. 
Now, all conditions of Theorem 3.2 are verified so that 


/ 
L,(00) 4 Ba. 
n 





2a. 1L! (89) = 2x, 


where D, is a-stable [notice that P((Z2 — 1)/2 > an/2) ~ P(Z2 > an) ~ n^]. 
Since x,/n = a- 1 — 0, Proposition 4.3 implies 


xn Ôn — 00) $ —27! Bc! D = Dw. 


Recalling that a linear transformation of an a-stable random vector is again 
a -stable (see [27]), we conclude the proof of the theorem. D 


4.4. Verification of strong mixing with geometric rate of (Y;). To begin with, 
we quote a powerful result due to Mokkadem [21], which allows one to estab- 
lish strong mixing in stationary solutions of so-called polynomial linear stochastic 
recurrence equations (SREs). A sequence (Y,) of random vectors in Ri obeys a 
linear SRE if 


(4.16) Y, = P, Yi.1 + Qr, 


STABLE LIMITS AND GARCH 315 


where ((P,, Q;)) constitutes an i.i.d. sequence with values in R^*7 x R^. A lin- 
ear SRE is called polynomial if there exists an i.i.d. sequence (e;) in R such that 
P, = P(ej) and Q, = Q(e;), where P(x) and Q(x) have entries and coordinates, re- 
spectively, which are polynomial functions of the coordinates of x. The existence 
and uniqueness of a stationarity solution to (4.16) has been studied by Brandt [10], 
Bougerol and Picard [7], Babillot et al. [1] and others. The following set of con- 
ditions is sufficient: Elog* ||P] || < oo, Elog* |Q1| < oo, and the top Lyapunov 
coefficient associated with the operator sequence (P;) is strictly negative, that is, 


(4.17) p = inf(t ! E log ||P, -Pill |i 2 1) <0. 


Here || - || is the operator norm corresponding to an arbitrary fixed norm | - | in R4, 
for example, the Euclidean norm. The following result is a slight generalization of 
Theorem 4.3 in [21]; see the beginning of the proof below for a comparison. 


THEOREM 4.5. Let (ej) bean i.i.d. sequence of random vectors in R. Then 
consider the polynomial linear SRE 


(4.18) Y; = P(e)Yi-i + Q(eo, 


where P(e;) is a random d x d matrix and Q(e,) a random R¢-valued vector. 
Suppose: 


1. P(1) has spectral radius strictly smaller than 1 and the top Lyapunov coefficient 
p corresponding to (P(e;)) is strictly negative. 
2. There is an s > Q such that 


E|P(ei)| «oo and E|Q(e)| < oo. 


3. There is a smooth algebraic variety V C R? such that e; has a density f with 
respect to Lebesgue measure on V . Assume that 0 is contained in the closure of 
the interior of the support { f > O}. 


Then the polynomial linear SRE (4.18) has a unique stationary ergodic so- 
lution (Y;) which is absolutely regular with geometric rate and consequently 
strongly mixing with geometric rate. 


REMARK 4.6. As regards the definition of a smooth algebraic variety, we first 
introduce the notion of an algebraic subset. An algebraic subset of the R? is a set 
of the form 


V — (xeR^ | Fi) =--- = F(x) 0), 


where Fi,..., F, are real multivariate polynomials. An algebraic variety is an al- 
gebraic subset which is not the union of two proper algebraic subsets. An algebraic 
variety is smooth if the Jacobian of F = (F1,..., F-)’ has identical rank every- 
where on V. Examples of smooth algebraic varieties in R^ are the hyperplanes 
of RË or V =R”. 


516 T. MIKOSCH AND D. STRAUMANN 


REMARK 4.7. Recall that absolute regularity (or B-mixing) is a mixing notion 
which is slightly more restrictive than strong mixing: 


E( sup [P(B | o(¥s,s <0) ~ PO) =: by > 0, k — oo. 
Beo (Y,,t»k) 

Indeed, B-mixing implies strong mixing with the same rate function; see [14] for 
details on mixing. 


PROOF OF THEOREM 4.5. If E|P(ei)|? < 1 for some $ > 0, we can ım- 
mediately apply Theorem 4.3 in [21]. In the general case, we use Mokkadem's 
result to prove absolute regularity with geometric rate for some subsequence 
(Y,) = = (Yim)1ez, some m > 1, by observing that (Y +) satisfies the linear SRE 
(4.19) below. The subsequence argument works because the mixing coefficient b; 
is nonincreasing and since (Y;) is a Markov process. Then one has the simpler 
representation 


b E( sup |P(B|o(¥o)) - POD)|) 
Beco(Yya1) 
see, for example, [9]. 

Since p < O, there is an m > 1 with Elog||P(e;)---P(eji)l < 0. From 
the fact that the map u > E||/P(e,,)---P(e;)||* has first derivative equal to 
E log |P(ej,) ---P(ej)]| at u = 0, we deduce that there is an 0 < $ < s with 
E|P(e,,) ---P(ej)|* < 1. Then note that (Y,) = = (Yi) obeys a linear SRE: 


(4.19) Y, = P(&)Y, i + Q&), 
where 
Erm 
er 
€(r— Don-F1 
and 


P(&) sm Perm) pido P(eq-1ym41). 


m-—1 
Q(é) = = Q (em) + ». (TI P(CQim4i— D) Oem) 


j=l \i=l 


Since both the matrix P(é) and the vector Q(&) are polynomial functions of 
the coordinates of e, and the sequence (e) is 1.1.d., (Y 1) obeys a polynomial lin- 
ear SRE. Observe that P(0) — = (P(0))” has spectral radius strictly smaller than 1, 

that E||P@,)|\5 < 1 and E|O(é) | « oo and that ej has a density with respect 
to Lebesgue measure on V", where V" is a smooth algebraic variety (see A.14 


STABLE LIMITS AND GARCH 517 


in [21]). Thus, an application of Theorem 4.3 in [21] yields that (Y +) is absolutely 
regular with geometric rate. This proves the assertion. C] 


The following two facts will also be needed. 


LEMMA 4.8. Let (P4) be an i.i.d. sequence of k x k matrices with E||P,||° < 
oo for some s > 0. Then the associated top Lyapunov coefficient p < 0 if and only 
if there exist c > 0,5 > 0 and à < 1 so that 


(4.20) EJP,.Pil cA, rtl. 


PROOF. For the proof of necessity, observe that there exists n > 1 such that 
E log ||P, --- P1]| < 0. From the fact that the map u > E||P,,---P;||* has first 
derivative equal to E log |P, ---P;|| at u — 0, we deduce that there is an 5 > 0 
with E ||P,- Pi || =A < 1. Since the operator norm ||- || is submultiplicative and 
the factors in P, --- P, are 1.1.d., 


EIP, Py sx max Ele Bill) er. jmd 
ml,.,n— 


for c — àT! (maxs1, .,n—1 EllPe--- Pi lI) and À = Al/^, Regarding the proof of 
sufficiency, use Jensen's inequality and imot E log ||P; --- P1]|] = p to con- 
clude 


1 ; J ; 
p= lim —Elog ||P, ---P1|? < limsup — log E ||P; --- P; 
t—00 fS t~oo ÍS 


l log A 
< lim sup — (logc + t1og 4) = = <0. 
t—5oo [8 S 


This completes the proof of the lemma. O 


LEMMA 4.9. Suppose that 


_ [Ar Orxk-r) 
(4.21) pom f C, ; t € Z, 
forms an i.i.d. sequence of k x k matrices with E ||P || < oo, s > 0, where A; € 
R'*', B, e RETS" and C, e R&-*&—-"), Then its associated top Lyapunov 
coefficient pp < 0 if and only if the sequences (A;) and (Ci) have top Lyapunov 
coefficients pa < 0 and pc < 0. 


PROOF. For the proof of sufficiency of pa < 0 and pc < 0 for pp < Q, itis by 
Lemma 4.8 enough to derive a moment inequality of the form (4.20) for (P;). By 


induction we obtain 
^», a [Ar At p 
Beonu("ut eee). 


518 T. MIKOSCH AND D. STRAUMANN 


where 
Q; = B;Ar-1 Ai + CB 1A; 25: Ay + Cy C;—1 By—2Ay_3 A1 
+++ +C,---C3B2A) + C,--- CoB. 
Observe that 
max(||A; - -- Ail], IC: ---Cill) 


€ [Pr --- Pill SA: Ail + IC: Cif + Qr. 


It is sufficient to show (4.20) for each block in the matrix P; --- Pı. Because of 
pa <0, pc < 0 and EIA lF, BIC, > < EIP l? < oo, Lemma 4.8 already im- 
plies moment bounds of the form (4.20) for (A;) and (C;). Thus, we are left to 
bound ||Q; ||. Without loss of generality, we may assume that the constants 2 < 1 
and s,c > 0 in (4.20) are equal for (A;) and (C;) and that s < s < 1. From an 
application of the Minkowski inequality and exploiting the independence of the 
factors in each summand of Q;, we obtain the desired relation 


E|Q;I5 <c7t EB, A7 « ex, 


(4.22) 


for some À € (A, 1), € > 0. For the proof of necessity, assume pp < 0. Then the 
left-hand side estimates in (4.22) and Lemma 4.8 imply that pa < 0 and pc « 0. 
O 


We now exploit Theorem 4.5 in order to establish strong mixing with geometric 
rate of the sequence (Y;) = (G;¥;), where G; = h/(09)/o7 and Y; = (Z2 — 1)/2. 


PROPOSITION 4.10. Let (Xj) be a stationary GARCH(p, q) process with 
true parameter vector 09. Moreover, assume that Z; has a Lebesgue density f, 
where the closure of the interior of the support {f > 0] contains the origin. Then 
(Y) is absolutely regular with geometric rate. 


PROOF. For the proof of this result, we first embed (Y;) in a polynomial linear 
SRE. Without loss of generality, assume p, q > 3. Write 


Y, — (sia. * ^». dr oio: Xt "et a Xp 
9h;41(80) 9h... 2 (00) dhy+1 (00) Ohy—g+2(90) 
da ^ — dag pu Qa p Per 02 p 
9h41 (80) dh;y—g+2(90) 9h. (80) Ohy—g+2 wy 
OB} J 99-3 OB} 9 } 9p, j 3 0p, 


Since Z? = X2/o7, we have 


c(Y,t-k)Co(Y,t»k) and o(¥;,1 <0) Co(¥;,t <0). 


STABLE LIMITS AND GARCH 519 


Consequently, it is enough to demonstrate absolute regularity with geometric rate 
of the sequence (Y;). We introduce various matrices. Write 05, xa, for the d; x d» 
matrix with all entries equal to zero and let L; denote the identity matrix of dimen- 
sion d. Then set 


Tt P; o? a 
Mi(Zi) = l- O(g-1)x1 O(g—1)x(p—2) O(g-1)x1 
Sr O1x1 01x (p—2) 01x1 


0—2x(«-0 9p-2xi1 9(-2x(o-2 90-21 


where 
T: =(P? +a?z?, Bo asic, B,21) € R47, 
E, = (Z2,0,...,0) e R97}, 


o? = (05, ...,05 )) € RPT, 
Moreover, define 
0 _ 
iion 1) Vi 
Mp(Z;) = j and Ma-| : |, 
i y 
U, : 


where U; e RI” (P+4-1) and V, e R4X(?*4^U are given by 
[Ui]? = 8e, 11 Z, 
[U, ke = Ske, 1¢qti-1): i> 2, 
[V jl, ¢ = 9kt,1j- 
Here 6. denotes the Kronecker symbol. Also introduce the q x q matrix 


Es fs o Eu i 


and let 
M3 = diag(C, p + 1), Ms = diag(C, q) 


be the block diagonal matrices consisting of p + 1 (or q) copies of the block C. 
Finally, we define 


Mi(Z)  O(pt+g—1)x(ptq Ocp+q-1)xq? 
P(Z;) = | Mo(Z;) M3 Oig? 


M4 052 (5.-1)q Ms 


520 T. MIKOSCH AND D. STRAUMANN 


and Q e R^*4-|*«(G4*U by [Q]; = æoôk,1 + dk, 5... Differentiating both sides 
of (4.8) at the true parameter 0 = 06, we recognize that 


h; ,4(00) Me (1, K us e.a E er =p’ o?, 1/894 put 
+ Bh; (00) +--+ + B oe 


From this recursive MF together with o. gea = ag + APX? +e 


a Xj -p +Ê? o? Bc cd we derive a polynomial linear SRE for Č: 


(4.23) Y, = P(Z)Y,.; +Q. 
The proof of Proposition 4.10 follows from the following lemma. O 


LEMMA 4.11. Under the assumptions of Proposition 4.10, the polynomial 


linear SRE (4.23) has a strictly stationary solution (Y,) which is absolutely regular 
with geometric rate. 


PROOF. The aim is to show that (4.23) obeys the conditions of Theorem 4.5. 
Since E Ze = ], itis immediate that E ||P(Z1)|| < oo since this statement is true for 
the Frobenius norm and all matrix norms are equivalent. Treat the blocks M(Z;), 
M; and M4 separately Observe that the matrix Mı (Zp) appears in the linear SRE 
for the vector S, = (c? d prti S T X? ie o Epa) , hamely, 


S; = Mj (Z;)S;—1 + (a5,0, ..., 0). 


Theorem 1.3 of [6] says that (1.1) admits a unique stationary solution if and only 
if (M1(Z;)) has strictly negative top Lyapunov coefficient; consequently, ow, < 0. 
Moreover, arguing by recursion on p and expanding the determinant with respect 
to the last column, it is easily verified that M; (0) has characteristic polynomial 


q 
det(AIp+q—1 — Mj (0)) = Pta! (i — yarn). 


i=l 
Since (4.1) holds for a stationary GARCH (p, q) process, by the triangle inequality 


q q q 
1—) BA 1-5 peat z1-5 0 
ix] i=] ts] 


if JA] > 1 and, hence, M (0) has spectral radius < 1. Observe that the building 
block C has characteristic polynomial 


q 
det(AI, — C) 2 A7 (i = yer), 


[zl 








showing that its spectral radius is strictly smaller than 1 (use the same argument 
as before). Thus, the deterministic matrices M3 and Ms have spectral radius < 1, 


STABLE LIMITS AND GARCH 521 


which also implies that their associated top Lyapunov coefficients are stricly neg- 
ative. Combining these results, we deduce that P(0) has spectral radius < 1 and 
conclude by twice applying Lemma 4.9 that (P(Z;)) has strictly negative top 
Lyapunov coefficient. Hence, by Theorem 4.5 the stationary sequence (Y+) is ab- 
solutely regular with geometric rate. U 


REMARK 4.12. Since (X2, a?) is a subvector of Y,, stationary GARCH(p, q) 
processes are absolutely regular with geometric rate; this result has previously been 
established by Boussama [8]. 


Acknowledgments. Daniel Straumann would like to thank RiskLap Zürich 
for the opportunity to conduct this research. In particular, he is grateful to Paul 
Embrechts and Alexander McNeil for stimulating discussions. Both authors would 
like to thank the two referees for various remarks which led to a better presentation 
of the paper. 


REFERENCES 


[1] BABILLOT, M., BOUGEROL, P. and ELIE, L. (1997). The random difference equation X, = 
Ag X41 + Bn in the critical case. Ann. Probab. 25 478—493 MR1428518 
[2] BASRAK, B., DAVIS, R. A. and MIKOSCH, T (2002). Regular variation of GARCH processes. 
Stochastic Process. Appl. 99 95-115. MR1894253 
[3] BERKES, I., HORVATH, L. and KOKOSZKA, P (2003) GARCH processes: Structure and es- 
timation. Bernoulli 9 201—227 MR1997027 
BILLINGSLEY, P (1968). Convergence of Probability Measures Wiley, New York. 
MR0233396 
[5] BINGHAM, N. H., GOLDIE, C M. and TEUGELS, J L. (1987). Regular Variation. Cambridge 
Univ. Press. MR0898871 
[6] BOUGEROL, P and PICARD, N (1992). Stationarity of GARCH processes and of some non- 
negative time series. J. Econometrics 52 115—127. MR1165646 
[7] BOUGEROL, P and PICARD, N. (1992). Strict stationarity of generalized autoregressive 
processes. Ann. Probab. 20 1714-1730. MR1188039 
[8] BOUSSAMA, F. (1998). Ergodicité, mélange et estimation dans les modéles GARCH. Ph.D. 
dissertation, Univ. Parts 7 
[9] BRADLEY, R. C (1986). Basic properties of strong mixing conditions. In Dependence in Prob- 
ability and Statistics (E. Eberlein and M S. Taqqu, eds.) 165—192. Birkhauser, Boston. 
MR0899990 
[10] BRANDT, A. (1986) The stochastic equation Yp] = An Yn + Bj with stationary coefficients. 
Adv. in Appl. Probab. 18 211-220. MR0827336 
[11] BREIMAN, L. (1965) On some limit theorems similar to the arc-sin law. Theory Probab. Appl. 
10 323-331. MR0184274 
[12] DAVIS, R. A. and HSING, T. (1995). Point process and partial sum convergence for weakly de- 
pendent random variables with infinite variance. Ann. Probab, 23 879-917. MR1334176 
[13] Davis, R. A. and MIKOSCH, T. (1998). The sample autocorrelations of heavy-tailed processes 
with applications to ARCH. Ann. Statist. 26 2049-2080. MR1673289 
[14] DOUKHAN, P. (1994). Mixing. Properties and Examples Lecture Notes in Statist 85 Springer, 
New York. MR1312160 


[4 


bd 


522 T. MIKOSCH AND D. STRAUMANN 


[15] EMBRECHTS, P, KLUPPELBERG, C. and MIKOSCH, T. (1997). Modelling Extremal Events 
for Insurance and Finance Springer, Berlin. MR1458613 

[16] HALL, P. and YAO, Q. (2003). Inference in ARCH and GARCH models with heavy-tarled 
errors. Econometrica 71 285-317 MR1956860 

[17] IBRAGIMOV, I. A. and LINNIK, YU. V. (1971) Independent and Stationary Sequences of 
Random Variables. Wolters-Noordhoff, Groningen. MR0322926 

[18] LEADBETTER, M. R., LINDGREN, G. and ROOTZÉN, H. (1983). Extremes and Related Prop- 
erties of Random Sequences and Processes. Springer, Berlin. MR0691492 

[19] MikoscH, T (2003). Modeling dependence and tails of financial tume series In Extreme 
Values in Finance, Telecommunications, and the Environment (B. Finkenstadt and 
H. Rootzén, eds.) 185~286. Chapman and Hall, Boca Raton, FL. 

[20] MIKOSCH, T. and STRAUMANN, D (2006). Stable limits of martingale transforms with appli- 
cation to the estimation of GARCH parameters. Available at www.math.ku.dk/-mikosch/ 
Prepnnt/Stab 

[21] MOKKADEM, A. (1990). Propriétés de mélange des processus autorégressifs polynomiaux. 
Ann. Inst. H. Poincaré Probab Statist. 26 219—260. MR1063750 

[22] RESNICK, S I. (1986). Point processes, regular variation and weak convergence. Ady. in Appl. 
Probab. 18 66-138. MR0827332 

[23] RESNICK, 5. I. (1987). Extreme Values, Regular Variation, and Point Processes. Springer, New 
York. MR0900810 

[24] ROSENBLATT, M (1956). A central limit theorem and a strong mixing condition. Proc. Natl 
Acad Sci. U.S.A. 42 43-47 MR0074711 

[25] RVACEVA, E. L (1962). On domains of attraction of multi-dimenstonal distributions In Select. 
Transi. Math. Statist. Probab. 2 183-205. Amer. Math Soc., Providence, RI. MR0150795 

[26] SAMORODNITSKY, G. (2004). Extreme value theory, ergodic theory and the boundary be- 
tween short memory and long memory for stationary stable processes. Ann. Probab, 32 
1438-1468 MR2060304 

[27] SAMORODNITSKY, G and TAQQU, M. S. (1994). Stable Non-Gaussian Random Processes 
Stochastic Models with Infinite Variance. Chapman and Hall, London. MR1280932 

[28] STRAUMANN, D. and MIKOSCH, T (2006). Quasi-maximum-likelihood estimation in con- 
ditionally heteroscedastic time series A stochastic recurrence equations approach. Ann. 
Statist. 34. To appear. 


LABORATORY OF ACTUARIAL MATHEMATICS RISKLAB 

DEPARTMENT OF APPLIED MATHEMATICS DEPARTMENT OF MATHEMATICS 
AND STATISTICS ETH ZURICH 

UNIVERSITY OF COPENHAGEN ETH ZENTRUM 

UNIVERSITETSPARKEN 5 CH-8092 ZURICH 

DK-2100 COPENHAGEN @ SWITZERLAND 

DENMARK E-MAIL straumann G math ethz ch 

AND 

MAPHYSTO 


THE DANISH RESEARCH NETWORK 
FOR MATHEMATICAL PHYSICS 
AND STOCHASTICS 

E-MAIL: mikosch@ math ku.dk 


The Annals of Statstics 

2006, Vol 34, No 1, 523—545 

DOI 10 1214/009053605000000822 

© institute of Mathematica! Statistics, 2006 


SEQUENTIAL IMPORTANCE SAMPLING FOR 
MULTIWAY TABLES! 


Bv YuGUO CHEN, IAN H. DINWOODIE AND SETH SULLIVANT 


University of Illinois at Urbana-Champaign, Duke University 
and Harvard University 


We describe an algorithm for the sequential sampling of entries in multi- 
way contingency tables with given constraints. The algorithm can be used for 
computations in exact conditional inference. To justify the algorithm, a theory 
relates sampling values at each step to properties of the associated toric ideal 
using computational commutative algebra. In particular, the property of inter- 
val cell counts at each step is related to exponents on lead indeterminates of 
a lexicographic Grobner basis. Also, the approximation of integer program- 
ming by linear programming for sampling 1s related to initial terms of a toric 
ideal. We apply the algorithm to examples of contingency tables which ap- 
pear in the social and medical sciences. The numerical results demonstrate 
that the theory is apphcable and that the algorithm performs well. 


1. Introduction. Sampling from multiway contingency tables with given 
constraints can be used to compute exact Monte Carlo p-values of goodness-of-fit 
and parameter significance for conditional inference. This 1s desirable when the 
tables of interest are numerous but have entries that raise doubts about the valid- 
ity of asymptotic methods. A classical application is testing for Hardy-Weinberg 
equilibrium with multiple alleles, where some alleles may be quite rare and result 
in sparse tables [20]. Other applications are described in [2, 6, 13]. A more general 
problem is sampling from nonnegative integer lattice points. This includes contin- 
gency tables, and further applications such as Monte Carlo EM algorithms with 
incomplete data [31] and Bayesian computation of posterior distributions [30]. 

Markov chain Monte Carlo (MCMC) has been a popular technique for gen- 
erating random samples from tables with given constraints. It is usually easy to 
program, does not require a lot of memory, and has wide applicability. Diaconis 
and Sturmfels [14] gave algebraic characterizations of the moves necessary to run 
such a Markov chain. However, for some loglinear models the constraints from suf- 
ficient statistics on multiway tables make it difficult to design irreducible Markov 
chains. Diaconis and Sturmfels [14] gave a method to produce Markov moves 


Received July 2004; revised February 2005. 
‘Supported by NSF Grants DMS-02-00888, DMS-02-03762 and DMS-05-03981 and NSF Grant 
DMS-01-12069 to SAMSI. 
AMS 2000 subject classifications. Prumary 62H17, 62F03; secondary 13P10. 
Key words and phrases. Conditional inference, contingency table, exact test, Monte Carlo, se- 
quential importance sampling, toric ideal. 


523 


524 Y. CHEN, I. H. DINWOODIE AND S. SULLIVANT 


that connect all tables with given constraints, but in some practical cases, such as 
large logistic regression examples, the moves cannot be computed. It is sometimes 
possible to do computations with a smaller collection of moves by letting some 
entries in the space of tables go negative. This idea is used in [4, 7]. The cost is a 
longer running time for the Markov chain. In general, the running times of these 
Markov chains are very difficult to judge. Therefore, Markov chains have three 
disadvantages: (1) they can be hard to design, (2) they can take a long time to run 
to stationarity, and (3) the time to run to stationarity may not be clear. 

Sequential importance sampling (SIS) avoids these disadvantages of Markov 
chains because it is relatively easy to implement and there is no issue of converg- 
ing to a stationary distribution. Chen et al. [6] introduced an SIS procedure for 
simulating two-way zero—one and contingency tables with fixed marginal sums, 
which compares favorably with other existing Monte Carlo-based algorithms. Sim- 
ilar techniques have also been applied to a logistic regression problem in [7]. This 
paper shows that SIS can be implemented efficiently for many multiway contin- 
gency table problems that have been studied mostly with Markov chains. 

The idea behind SIS is to sample cell entries in the contingency table one after 
the other so that the final joint distribution (1.e., the proposal distribution) is close 
to the target distribution. SIS does not have the same disadvantages as a Markov 
chain, because the method terminates at the last cell and generates 1.1.d. samples 
from the proposal distribution. However, SIS raises a new set of implementation 
issues. The main problems are approximating the support of the marginal distri- 
bution of each cell quickly, and then approximating the marginal distribution on 
the support set with a proposal distribution. We show how properties of the sam- 
pling set at each step can be deduced from algebraic conditions on a collection of 
Markov moves. The results of this paper extend the applicability of SIS from two- 
way tables [6] to a wider range of multiway tables and allow further comparison 
with Markov chain methods. 

The target distribution on the collection of tables may be hypergeometric, which 
arises in conditional inference with multinomial sampling, or it may be another re- 
Jated distribution such as the one for Hardy-Weinberg proportions. SIS can yield 
an approximate count of constrained tables very quickly when the target distribu- 
tion is uniform. This application has been carried out in [6], where SIS was shown 
to be more efficient than Markov chains for counting and testing two-way tables. 
Combinatorists are interested in counting tables with given constraints [11]. Count- 
ing tables is also related to conditional volume tests [13]. In our multiway exam- 
ples, we found approximate counts of tables without difficulty. Tbe exact counting 
software LattE [11] confirmed the counts on the two smaller examples. The uni- 
form target distribution is also useful for Bayesian applications where a uniform 
prior on probabilities leads to equally likely tables, and for the conditional volume 
test [13]. 

The paper is organized as follows. In Section 2 we introduce essential ideas 
of SIS. The algebraic conditions for efficient sampling are formulated in Sections 


SIS FOR MULTIWAY TABLES 525 


3 and 4. Many of the algebraic ideas of Markov chains on lattice points are used. 
Section 3 treats the basic case where properties of polynomials generating the toric 
ideal are related to SIS. Section 4 is more technical and develops stronger methods 
for subsets of the Markov basis. These results can apply when the observed mar- 
gins imply conditions of positivity on the tables constrained by the margin values. 

Section 5 is about the relationship between linear programming (LP) and integer 
programming (IP). When the support of the marginal cell distribution is an inter- 
val of integers [/, u], a situation established under conditions in Sections 3 and 4 
and which occurs often in practice, one needs the values of the upper and lower 
bounds. Knowing then that LP and IP give nearly the same answer is important, 
because using an IP algorithm at each step in the procedure would be much slower 
than using LP. A precise algebraic relationship between LP and IP is developed 
in [22], which gives an algorithm for finding the maximum difference between the 
two over all conceivable data sets. The results here may be easier to apply in some 
examples. In practice it is not essential that LP and IP be identical. Section 6 dis- 
cusses sampling distributions for different target distributions. In Section 7 we give 
a range of examples to show how well SIS can work in real problems. Section 8 
provides concluding remarks. 


2. Elements of SIS. Let Q denote the set of all contingency tables with given 
constraints. Assume (2 1s nonempty. The p-value for conditional inference on con- 
tingency tables can often be written as 


(1) = Epf (n) = ) fp), 

nct 
where p(n) is the underlying distribution on $2, which is usually uniform or hy- 
pergeometric and only known up to a normalizing constant, and f (n) is 2 function 
of the test statistic. For example, if we let 


(2) f (n) = 1((»xp(ny): 


where ny is the observed table, formula (1) gives the p-value of the exact test [20]. 
In many cases sampling from p(n) directly is difficult. The importance sampling 
approach is to simulate a table n € $2 from a different distribution q(-), where 
q (n) > 0 for all n € Q, and estimate u by 





-E f a)p()/q(ni) 
(3) j tum eer 
25-1 D(1,.)/q(n;) 
where n;, ..., ny arei.i.d. samples from q (n). We can also estimate the total num- 
ber of tables in Q by 
MET 
(4) R| = — ps 
N 2, q (n,) 


526 Y. CHEN, I. H. DINWOODIE AND S SULLIVANT 


because |Q| = J aeg za 4 (n). The underlying distribution on Q corresponding to 
this case is uniform. 

In order to evaluate the efficiency of an importance sampling algorithm, we can 
look at the number of i.i.d. samples from the target distribution that are needed to 
give the same standard error for jz as N importance samples. A rough approxima- 
tion for this number is the effective sample size [24] 


N 
1+cv?’ 
where the coefficient of variation (cv) is defined as 


(6) ao Raw) 
E2(p(n)/q(n)) - 


Accurate estimation generally requires a low cv”, that is, q (n) must be sufficiently 
close to p(n). We will use cv^ as a measure of ELICIERCY for an importance sam- 
pling scheme. In practice, the mecreuce, value of cv is unknown, so its sample 
counterpart is used to estimate cv?. The standard error of Ê or [Q] can be simply 
estimated by further repeated sampling [6]. 

SIS as it applies to multiway tables fills in the entries of a table cell by cell, 
in a way that guarantees that every table in $2 can be produced. More precisely, 
we stack all entries of the table into a long vector n, and start by sampling the 
first cell count nı of the vector n with a proposal distribution q (n). Conditional 
on the realization of the first cell, we sample the second cell count n2 with a pro- 
posal distribution q (n2|n1), and then move forward sequentially until all the cells 
are sampled. Denoting the cell counts of n by nj,...,mg, we can write the joint 
proposal distribution q as 


(5) ESS = 


q((m1,...,Na)) =q(n1)q(n2|n1)q(m3|n2, ni) -+-q(na|ng-1,...,m1). 


Ideally, one would like to sample a cell value from the marginal distribution of 
a cell entry, conditional on the entries that have already been sampled. However, 
these marginal distributions are quite difficult to compute explicitly except in very 
small examples. SIS then raises some problems if it is to be used effectively: 
(1) When and how can the support of the marginal distribution nj |(n, 1, ..., 711) 
be quickly determined or approximated? (2) How can the support of the marginal 
distribution be sampled with a proposal distribution q that is close to the true un- 
derlying distribution p? We address these questions in the following sections of 
the paper. 


3. Sequential intervals and algebra. When they apply SIS to the problem 
of sampling two-way contingency tables with fixed marginal sums, Chen et al. [6] 
notice that the support of the marginal distribution n;|(n, 1, ..., n1) is an interval 
of integers [here n — (n1, ...,ng) is the table in a vector format]. Therefore, they 


SIS FOR MULTIWAY TABLES 527 


can sample a value from the interval at each step and always produce a table in $2, 
that is, every table satisfies the constraints. This saves a lot of computing time com- 
pared to rejection sampling. Another advantage of having this interval property is 
that one can find a good proposal distribution q (n, |nj 1, ..., n1) more easily than 
in the situation where there are gaps in the support set. 

SIS tends to perform better when the sequential interval property holds, but for 
general constraints on multiway tables, it is not always true that one can fill in 
entries 1n sequence and expect the range of feasible values to be an interval] of 
integers. Examples where the sequential interval property does not hold are very 
sparse logistic regression [7], many 3-way tables with certain margin constraints 
(see [12] for the full range of difficulties with 3-way tables) and some triangular 
tables of genotype data when cells are sampled in certain orders. Typically, there 
may be a problem if the moves of a Markov basis involve changes in some entry 
that are of size +2 or larger. A precise condition is more complicated and weaker 
than "no moves of size greater than 1," and may depend on the margin values and 
the order of the sequential sampling. In this section we give the basic theorems that 
are not related to the actual values of the margin constraints. In the next section we 
strengthen the results. 

Now we introduce notation for lattice points and the algebra of polynomials 
that will be used in our study of SIS. Let A be an r x d matrix of nonnegative 
integers, denoted Z+. In applications d is the number of cells in the table, and r is 
the number of parameters (not necessarily free) in an exponential family model. A 
is often referred to as the constraint matrix and r is the total number of constraints. 
We assume that a sum of some nonempty subset of the rows of A is a strictly 
positive vector. In applications with multinomial sampling, this will be immediate 
because the sample size is fixed, so the constant vector of ones is a row or is in the 
row space of A. Fort € Z}, let 


A^! [t] :— [n € Z^:An— t). 


This is a collection of tables with linear constraints, that is, the set of nonnegative 
integer points inside a polytope. The linear constraint value t will sometimes infor- 
mally be called a margin constraint. The value of t will typically be the sufficient 
statistics for a loglinear model. Our primary goal is to sample from A! [t]. 

Let us first recall the notion of a Markov move on A~![t]. If m € kerz(A) 
(the null space of A in the integers), then m is a Markov move. With a collection 
of such moves, one can define a symmetric Markov chain on A^! [t] by starting at 
an initial state n € A^! [t], and then uniformly choosing one of the moves m and 
a sign on the move, and then moving to the new state n + m if this new vector 
is nonnegative (i.e., every entry is nonnegative). A Markov basis MA for A is a 
subset of kerz(A) such that, for each pair of vectors u, v € zt with Au = AY, 
there is a sequence of vectors m, € M4, i = 1,...,l, such that 


l 
u=v+) m, 
js] 


528 Y. CHEN, I. H. DINWOODIE AND S. SULLIVANT 


J 
O<v+)m, j=1,...,1. 
i=] 
That is, two nonnegative vectors with the same linear constraints can be connected 
with a sequence of increments from MA while always maintaining the linear con- 
straints and the nonnegativity. 
Define the polynomial ring Q[xi,..., x4] in indeterminates (polynomial vari- 
ables) x1, ..., xq, one for each cell. Define the toric ideal 


Ia :— (x* — x" : An = Am), 


where x! :— x] x3" +++ x7 is the usual monomial notation for a nonnegative in- 
teger vector of exponents n = (nj,...,ng). The way to go between Markov 


moves and polynomials is simple: order and number the cells in the table, cre- 
ate an indeterminate (polynomial variable) for each cell in the table, and put 
the positive Markov move cel] values on one monomial, put the negative values 
on another monomial, then form the difference. For example, the Markov move 
(1, —1, —1, 1)’ can be denoted as x1 x4 — x2xs. The choice of cell ordering can be 
important, as in Example 7.5. 

There are two fundamental algebraic ideas related to Markov bases. For 
m € ZZ, define m+ = max(0, m), m~ = max(0, —m), so m = m+ — m^. The 
first fundamental result, shown by Diaconis and Sturmfels ([14], Theorem 3.1), is 
that a finite generating set of binomials (xim — x i — 1,...,g) for /4 defines 
Markov moves (m? — m; ), i — 1,..., g, that are a Markov basis in that they 
connect all of A^! [t] when chosen randomly as vector increments, regardless of 
the actual value of t. In other words, a Markov basis always exists independently 
of the actual values of the linear constraints. The second fundamental result ([29], 
Theorem 8.14) is that a collection of moves will connect two tables n and m if 
x" — x™ c J, where / is the ideal generated by the collection of moves. This is 
used to show connectivity for subcollections of the full Markov basis for particular 
values of t in Section 4. 


DEFINITION 3.1. Define the projection operator zt : Z — Z by m(z1,..., 
Zd)-—2i. 


LEMMA 3.1. Suppose a Markov basis MA satisfies m,(Ma) C {—1, 0, 4-1). 
Then zi (A^! [t]) is an interval of integers [I1, uy]. 


PROOF. One can connect tables m, n € A^! [t] with values m, and n; in the 
first coordinate by changing the first coordinate only +1 at each step, so the gap 
between possible values cannot be greater than 1. |! 


If the columns of A are aj,..., aq, let A; = (8, 8,41,..., ag) be the matrix that 
deletes the first i — 1 columns and keeps the last d — i + 1 columns of A. 


SIS FOR MULTIWAY TABLES 529 


DEFINITION 3.2. The polytope A~![t] has the sequential interval property 
if xı (A^! [t]) is an interval of integers [/;, u1], and fori = 1,...,d — 1: if n; € 
mı (A [t — nia; — +++ —n,—1€,—1)), then 71 (Ard [t — nia; — «T 18;-] — 
n,a ]) is also an interval of integers [/,4.1, 4+1]. 


The next result is the most basic connection between the sequential interval 
property and the exponents of a lex basis for the toric ideal. An important point 
is that the condition does not require that all exponents in the Markov basis have 
magnitude at most 1. Rather, it requires that the exponent be at most 1 on the 
indeterminate x, (square-free in x,) on the moves that involve only the present 
and future cells i,i + 1,...,d in the lex basis. This point is important for many 
examples, including 3 x 3 x 3 tables with no-3-way interaction (Example 7.4). 

With a particular cell order, the indeterminates are typically ordered x, > x2 > 
--+ > Xd, and then one can introduce term orders. We primarily use the lexico- 
graphic term order (lex order), which totally orders monomials (or, equivalently, 
their vector exponents corresponding to tables) by declaring x" > x™ if and only 
if the first entry from the left in n — m is positive (or n is after m in the dictio- 
nary sense). Cox, Little and O’Shea ([10], page 52) explain term orders, including 
the grevlex order that we use in Section 5 where the indeterminates are taken in 
reverse order xg > Xg—-] > ++: > X]. 

In the following, we use the term "Gróbner basis,” which 1s a special generating 
set for an ideal ([10], page 74). Lex Gróbner basis (or lex basis) will mean Gróbner 
basis with respect to lexicographic term order ([10], page 54) and reduced Gróbner 
basis is a unique representation ([10], page 90). 


PROPOSITION 3.1. Suppose a Markov basis Ma = {+m], ..., mg} has 
the property that G :— (xm — xi i= 1,...,g) is a lex Gróbner basis with 
ordering x, > X2 > ++: > xq on indeterminates and suppose the elements of 
GN Q[xi, ..., xa] are square-free in x, for each i. Then A- [t] has the sequential 
interval property for all t. 


PROOF. By the elimination theorem ([10], page 113), the lex basis G has the 
property that GM Q[x;, ..., x4] is a Gróbner basis for the ideal 74, = (x™ — x", 
A,m = A;n). Hence, by Theorem 3.1 of Diaconis and Sturmfels [14], the differ- 
ence of the exponents (together with signs +) of elements in G N Q[x,,..., xa] is 
a Markov basis with 0 in coordinates 1, 2, ...,i — 1. An application of Lemma 3.1 
to the matrix A, with first coordinate n, completes the proof. UO 


When using this result, some orders on the cells may have the square-free prop- 
erty and others may not, so it can be used to find good orderings on the cells. The 
sensitivity to cell ordering shows up in many examples, including logistic regres- 
sion and Hardy- Weinberg testing with genotype data (Example 7.5). 


530 Y. CHEN, I. H. DINWOODIE AND S. SULLIVANT 


In fact, the converse to Proposition 3.1 is also true, in the sense that matrices A, 
such that A^! [t] has the sequential interval property regardless of t, are character- 
ized by their lex Gróbner bases. 


PROPOSITION 3.2. Let A be a nonnegative integer matrix such that A-[t] 
has the sequential interval property for all t. Then the reduced lex Gróbner basis G 
for I4 with ordering x1 > x2 > --- > xg has GN Q[x;,...xq] square-free in x; for 
all i. 


PROOF. It suffices to prove the claim on the first cell, the rest following by 
induction. Let G :— (xm — x™, } be the reduced lex Gróbner basis. In particular, 
none of the monomials x™ is divisible by the leading monomial of any other 
binomial in 74. Suppose there is some xm” xy eG with zj(m*) =a > 1. Let 
t= Amt. Since A^! [t] has the sequential interval property and 7; (m^) = 0, there 
exists n € A^! [t] with zr; (n) =a — 1. Then the binomial x; ^*! (x^ — x?) € I4 
is not equal to xm" — x^. and has leading term x, a*lym which divides x™*. 
This is a contradiction and x™* — x™ is not in the reduced Gróbner basis G. O 


4. Markov subbases. In this section we give results that can be used when 
the full Markov basis does not have the required properties to guarantee sequen- 
tial intervals. Situations where this occurs include logistic regression [7] and Ex- 
ample 7.3, where the lex bases for the toric ideals do not have the conditions of 
Proposition 3.1. 

The results in this section use the particular values of the margin constraints, 
which may allow a smaller and simpler connecting set that we call a Markov sub- 
basis. An existing method to study connectivity properties of subsets of a Markov 
basis is the primary decomposition ([10], page 208). While useful in some ex- 
amples, it is usually quite difficult to compute. The methods in this section use 
computational tools that are more easily applied in many cases. 

To motivate some of the ideas that follow, recall that in some contingency ta- 
bles it is possible to easily identify a reasonable collection of Markov moves that 
preserve the required constraints and are a basis in the linear algebra sense for the 
kernel of the constraint matrix. However, a basis in the linear algebra sense does 
not always give a Markov basis—the Markov basis allows you to connect all ta- 
bles while remaining nonnegative, a condition not guaranteed by the linear algebra 
basis. The smaller collection, while not a Markov basis, may connect tables with 
certain margin values while remaining nonnegative. The linear algebra basis can 
be enlarged to a Markov basis by a process called saturation discussed below, and 
the result can be much more complicated than the original collection of moves. 

A lex basis for the toric ideal 74 for a constraint matrix A is quite special in 
that the Markov moves that involve cells i, i + 1,..., d are a lex basis for the toric 
ideal for /4,. This is a consequence of the elimination theorem, and means that 


SIS FOR MULTIWAY TABLES 531 


one lex basis calculation gives sequential sampling information about all the cells 
in sequence. With a collection of moves smaller than a lex basis for 74, the theory 
is more difficult. 

A Markov subbasis M4, for t € Z} and integer matrix A is a finite subset of 
kerz(A) such that, for each pair of vectors u, v € A- [t], there is a sequence of 
vectors m; € Mat,i=1,...,/, such that 


L 
u=v+ om, 


iz] 


J 
O<v+) m, j=1,...,1. 


tæl 


The connectivity through nonnegative lattice points only is required to hold for 
this specific t. 


LEMMA 4.1. Suppose a Markov subbasis Ma satisfies nı(Ma t) C (—1, 
0, +1}. Then 1,(A7![t]) is an interval of integers |l, u1]. 


PROOF. One can connect tables with feasible values n; and m, in the first co- 
ordinate by changing the first coordinate only 4-1 at each step, so the gap between 
possible values cannot be greater than 1. O 


The following proposition is used in Examples 7.3 and 7.4, where Proposi- 
tion 3.1 cannot be used. Recall that a lex basis for a toric ideal has the property 
that each elimination ideal ([10], page 113) is also a lex basis for a remaining toric 
ideal, so applying Lemma 4.1 in sequence is immediate. With a subbasis, how- 
ever, one must add a technical condition involving saturation to get the sequential 
application of Lemma 4.1. 

Saturation (see [28], page 113 or [25], page 215) is an algebraic procedure 
that enlarges an ideal. In our case the ideal will correspond to a collection of 
Markov moves possibly less than a full Markov basis. If 7 is an ideal in the 
ring Q[x,,...,xq] and f is a polynomial, then the saturation of J by f (de- 
noted J: f??) is defined by 


I: f? := {g € Qpa,... xa]: fE- g € I for some k > 0}, 
which is also an ideal. For the indeterminate x;, J : x7? is the collection of polyno- 


mials g such that x* g is in the ideal J for some choice of the exponent k. 


PROPOSITION 4.1. Suppose MA. is a Markov subbasis, let Ma = (xm, 
..., mg} and let G := (xm — xh i —],...,g). Suppose G has the following 
three properties: (1) G is a lex Gróbner basis for the generated ideal Im, , with 


532 Y. CHEN, I. H. DINWOODIE AND S. SULLIVANT 


order x| > x2 » +++ > Xq on indeterminates; (2) G O Q[xi, ..., xq] are square- 
free in x; for each i; and (3) (Ig, :xP) O Q[xiei, ---, Xa] C IM; for each 
i —1,2,...,d — 1. Then the polytope A^ [t] has the sequential interval property. 


PROOF. By Lemma 4.1, 7,(A™![t]) is an interval. We must show that two 
tables in A^! [t] with a common entry in coordinate 1 can be connected with moves 
in Ma ¢ without touching coordinate 1. To see this, suppose tables u', v € A^! [t] 
have common first coordinate uj = v4 — c. 

Let u = (0, u2, U3, ..., uq), V = (0, v2, v3, ..., Ug). We must show that x" — 
x" € (GM Q[x2, ..., xaq]) to be able to connect them with moves in G that only 
involve changing the second coordinate (by only +1 at each step). Since G is a lex 
basis, (G N Qlx2,..., Xal) = IMa N Q[x2,..., Xd], and it is enough to show that 
x" — x" € I, ,. We have that x" — x € I4. 

Since xt (x" — x) = x" —xV c Im, the binomial x" — x” € (Ima: x00) O 
Q[x2, ..., xq]. Under the assumption Maa: XI) M Q[x2, ..., Xa] C Im,,, the 
first step is proven. 

Suppose now that two tables u’,v’ € A^![t] have common first two co- 
ordinates uj = vj = cj, U2 = v? = c2. Let u = (0,0,uas,u4,...,ug), v = 
(0, 0, v3, v4, ..., vg). We must show that x" — x” e (GN Q[xs,..., xa]) to be 
able to connect them with moves in G that only involve changing the third 
coordinate (by only +1 at each step). By the argument above, we have that 
xjX" — xy'x' € Iy,,. Then by the saturation condition on x2, x" — x* € 
(Ty. XS) N Olx3,..., xal C Mat 

The argument continues likewise for each cell in the order 1,2, ...,d. O 


To use Proposition 4.1, one must have in hand a Markov subbasis, which re- 
quires knowing some connectivity properties. These can be established sometimes 
with ad hoc arguments or with the primary decomposition of the ideal Zm, ,. 
Lemma 4.2 below is a new method to verify a Markov subbasis, and we use it 
in Example 7.3. The quotient “:” operation is defined by 7: f :— (g: f -g € I}, the 
result of one step of the saturation procedure defined above. 


LEMMA 4.2. Let M C kerz(A) be Markov moves with ideal Ij. Suppose 
each element n € A^ ![t] satisfies n; > 0 for all s € S C (1,...,d), and sup- 
pose that Uy: \||se5%s) = IA, the toric ideal. Then the moves in M connect all 
of A^! [t] and are therefore a Markov subbasis. 


PROOF. Letu,v € A^! [t], and let u’ = u — Is, v = v — Is, where Is is the 
vector with 1 in the coordinates that are in the set S, and 0 elsewhere. Clearly, 
x" —xV el A, SO by the saturation assumption (xU -- x") ]hesxs € Im. The 
fundamental result of Diaconis and Sturmfels ([14], Theorem 3.1) says that the 
moves in M connect u = u’ + 7s with v =v’ + Js through the nonnegative tables. 

[] 


SIS FOR MULTIWAY TABLES 339 


5. Bounds on cell entries. When the conditions for sequential interval prop- 
erty are met, the next question is how to quickly determine or approximate the up- 
per and lower bounds of the interval [/, u]. In very special cases one can use known 
formulas for the interval, such as the Fréchet bounds. This works in two-way ta- 
bles and some decomposable graphical models [6]. For general multiway tables, 
usually no simple formula is available to compute the bounds. Three general ways 
to determine or approximate the upper and lower bounds of the interval [/, u] are 
integer programming (IP), linear programming (LP) and the shuttle algorithm. IP 
always gives the exact integer bounds / and u, but it is much slower than the other 
two methods. 

LP in the rational numbers can dynamically find bounds on the interval at each 
step in the sampling. LP is much faster than IP, and under conditions that hold in 
many examples, LP gives the same answer as IP. The conditions we formulate are 
concrete algebraic conditions that can be checked with a preliminary calculation. 
Hosten and Sturmfels [22] study the difference between LP and IP from a different 
point of view. They give the largest possible difference over all constraint values, 
whereas our results use the particular constraint values of the data set. 

The numerical implementation of LP to determine an interval [/, u] must be 
done carefully. LP sometimes gives wider intervals than the true interval because 
LP considers solutions in a larger space. Roundoff of numerical approximations 
that come from floating point operations or interior point methods can result in 
sampling a number out of the feasible range [/, u] or into a strict subset of the 
feasible range which can lead to errors. The program that we embedded into the 
sampling code and that worked well is IpSolve [1]. 

A third way to approximate the intervals is the shuttle algorithm, described in [5] 
and [16]. This is an iterative method that usually does not give exact IP results, but 
it has two advantages in special cases: it is fast and easy to program, and it can 
be implemented without explicitly constructing a constraint matrix, a task which 
may be impossible for very large problems with millions of cells. In our numerical 
examples LP works better than the shuttle algorithm, in some cases much better. 

Consider the IP and LP problems 


u , (b) := max{n; : A n — b,n e Z$}, 
|, (b) :— min(n, : Ajn— b,n e Z$}, 
U (b) := max(4j: Aq — b, q € Q1], 
L, (b) := min{q,:A,q=b,q e Q1], 


where Zi, Q4 are the nonnegative integers and nonnegative rational numbers, 
respectively. We are interested in bounding the nonnegative quantities U; — uj 
and l, —L,. 

In Propositions 5.1 and 5.2 that follow, we use the relationship between lower 
and upper IP bounds and normal forms with respect to lex and grevlex term orders 


534 Y. CHEN, I. H. DINWOODIE AND S. SULLIVANT 


explained in [9] and stated in Algorithm 5.6 of [28], page 43. For the following 
proposition, let Ag It] = {q € Q4 : Aq = t], the set of nonnegative rational vec- 
tors with constraints t. 


PROPOSITION 5.1. Suppose a Markov subbasis Ma = [Xmj,..., mg} 
has the property that G :— (xm — x™ ij =1,...,g} is a lex Grobner basis 
with ordering x > x2 > --- > xq on indeterminates for the generated ideal Im, +- 
Also, suppose Im, t:l lse Iso Xs = Ia, where Sg is the collection of coordinates 


which are always positive for elements in Ao [t], and suppose Iy,,:xj^ N 
Q[xi-e1; Xd] C Im, , for each i —1,2,...,d — 1. 

If the coordinate values of all m* (i = 1,...,g) are in (0, 1), then Lj (t;) = 
L ,(t,) for all j =1,2,...,d and all t, given by tj =t, t; =t — ayn; — azn — 
es Bj 1n, J —2,...,d. 


PROOF. We show first the result that |; < Li. Let m ec A^! [t]. 
Use long division to compute the normal form of x™ with respect to I, ,. Let 


the normal form be the monomial x®’. It is nearly immediate that n? > l1, since 
the first coordinate of the normal form when dividing by a Gróbner basis for the 
full ideal 74 is I. 

Let q* solve Lı = min{q;:Aq=t,qé Q4). We show that gj >}, which 
together with nj > lı will prove the result Ly = I. 

Suppose by way of contradiction that nT > qj. Since q* is rational, an integer 
multiple, say Aq”, is integral. Then A(Aq*) = A(An*), so x?! — x^*' e I4. Fur- 
thermore, by the assumption of positivity of coordinates Sọ on elements in Ag! [t], 
it follows that q*, nf > 0 for s € Sg. Then x^" — x" € Im, , by the assumption 
IM a 4 ser tesda: 

Since G is a Gröbner basis for this ideal, one of the lead terms of the basis must 
divide the lead monomial x^". This means that the indices of positive coordi- 
nates of the exponents m" of the lead monomial must be included in the positive 
coordinates of n*. Since the corresponding coordinate values are O or 1, the di- 
visor must also divide n*. This contradicts its construction above as the normal 
form without divisors. Hence, it cannot be the case that n} > qj. This proves that 
lj xni sqi = Li. 

We show next the result that /? < L2. Let m € A^ ! [t]. 

Use long division to compute the normal form of x™ with respect to G2 :— 
G N Q[xo», x3, ..., xa], the elements of the subbasis that only involve coordi- 
nates 2,3,..., d. Let the normal form be the monomial x” , where ni = m1, which 
has not changed in the division. It is nearly immediate that n5 > 12, since the first 
coordinate of the normal form when dividing x ?..-x3^ by a Gróbner basis for 
the full ideal 74, is lo. 

Let q* solve L2 = min(g?: Aq = t, qu = m1, q € QF}. We show that q3 > n5, 
which together with n5 > l2 will prove the result L2 = lo. 


SIS FOR MULTIWAY TABLES 535 


Suppose by way of contradiction that n5 > q3. Since q* is rational, an inte- 
ger multiple, say Aq*, is integral. Then A(Aq") = A(An"), so x" y ela: 
Furthermore, by the assumption of positivity of coordinates Sg on elements 
in Ag [t], it follows that q*, n* > 0 for s € Sg. Then x^" — x4 c Mat 


yà Og «42 c Iu, 

Since G3 is a lex Gróbner basis for the ideal Zm, , Q[x2,..., xq] by the elim- 
ination theorem, one of the lead terms of the basis Gz must divide the lead mono- 
mial x^ 9^2. "2. since we have just shown that this is the lead monomial in a 
binomial that belongs to Ij, , N Q[x2, ..., xa]. This means that the indices of 
positive coordinates of the exponents m* of the lead monomial must be included 
in the positive coordinates of n*. Since the corresponding coordinate values are 
O or 1, the divisor must also divide n*. This contradicts its construction above as 
the normal form without divisors. Hence, it cannot be the case that n5 > q5. This 
proves that l2 <n} < q5 = L2. 

The remaining coordinates are proved similarly. L] 


There is a corresponding result for the upper bounds. Whereas the lex basis re- 
lates IP minimization to the normal form of a monomial, it is the grevlex basis that 
relates IP maximization to the normal form. We state the result below only for the 
first entry, since it must be applied repeatedly. Using the result requires recomput- 
ing a grevlex basis for each of the matrices A; (containing columns i, i + 1,...,d 
from A) and rechecking the condition, because we cannot simply apply an elimi- 
nation theorem on a single lex basis as before. 


PROPOSITION 5.2. Suppose a Markov subbasis Ma, = [-Em;,..., mg] 
has the property that G :— (xm — x i= 1,...,g) is a grevlex Gróbner ba- 
sis with ordering xq > Xq—1 > ++: > xı on indeterminates for the generated 
ideal Im, Also, suppose Imas: Isero Xs = I4, where Sg is the collection of 


coordinates which are always positive for elements in Ap [t]. If the coordinate 
values of m" are in (0, 1}, then ui(t) = Uj(t). 


PROOF. We show that U; < ui. Let m e A^! [t]. 

Use long division to compute the normal form of x™ with respect to the grevlex 
basis Jm, ,- Let the normal form be the monomial x" . It is nearly immediate that 
nj < u, since the exponent on x; of the normal form when dividing by a grevlex 
Gróbner basis with reversed indeterminate order for the full ideal /4 is u1. 

Let q* solve U; = max(gqi: Aq — t, qe Q4 ). We show that gj <j, which 
together with nj < u; will prove the result U; x u1. 


536 Y. CHEN, I. H. DINWOODIE AND S SULLIVANT 


Suppose by way of contradiction that gj > nj. Since q* is rational, an integer 
multiple, say Aq", is integral. Then A(Aq") = A(An*), so x?" — x'*' c I4. Fur- 
thermore, by the assumption of positivity of coordinates Sg on elements in Ag [t], 


it follows that q*, n% > 0 for s € Sg. Then x" — x4 € I 4, Dy the assumption 
Ty ; len Xs = lA. 

Since G is a Gróbner basis for this ideal, one of the lead terms of the basis must 
divide the lead monomial x^"". This means that the indices of positive coordi- 
nates of the exponents m^ of the lead monomial must be included in the positive 
coordinates of n*. Since the corresponding coordinate values are O or 1, the di- 
visor must also divide n*. This contradicts its construction above as the normal 
form without divisors. Hence it cannot be the case that gj > nj. This proves that 
Uj—-qriznizu,y O 


The corollary below, combining Propositions 5.1 and 5.2, applies directly to 
Examples 7.1, 7.2 and 7.4. 


COROLLARY 5.1. Jf alex Gróbner basis for I4 has square-free exponents on 
the lead monomials, then l, = L, for all j =1,...,d. If each grevlex Gróbner 
basis for TA, Jj =1,...,d, and indeterminate ordering xq > x41 > +++ > xj has 
square-free exponents on the lead monomials, then u, =U, for all j =1,...,d. 


PROOF. The assumptions of Proposition 5.1 hold if 7,4, , = I4, so the lower 
bounds from LP and IP are equal. For the upper bounds, the statement is a restate- 
ment of Proposition 5.2 for each step in the sequential sampling. LJ 


6. Sampling distributions. Assume that the sequential interval property 
holds for a multiway table with given constraints, and that the intervals can be 
approximated by LP. The next question is how to sample from these intervals. 
Ideally, we want to sample a cell value from the true marginal distribution of a cell 
entry conditional on the entries that have already been sampled. However, these 
marginal distributions are quite difficult to compute explicitly except in very small 
examples. SIS samples from a simple proposal distribution (rather than the true 
distribution) on the set of all possible marginal values. 

For a target uniform distribution, which is useful for counting the total number 
or tables and some Bayesian applications, we propose a uniform distribution on 
the available interval for each cell, that is, p(x) = 1/(u — 1 + 1) on integers in the 
interval [/, u]. We call this the “uniform sampling method.” With the length of the 
proposed sampling interval, the importance weights can be computed exactly for 
reweighting at the end. This strategy gives low cv? (<5 for all examples we have 
tested) and works very well on the examples in Section 7. 

For a target hypergeometric distribution, which arises in conditional inference 
with multinomial sampling, we propose to sample a cell value from the hyper- 
geometric distribution p(x) = (f) Qayu Ge) on the interval of available inte- 
gers [/, u]. We call this the “hypergeometric sampling method,” which is usually 


SIS FOR MULTIWAY TABLES 537 


(but not always, see Example 7.4) better than the uniform sampling method when 
the target distribution is hypergeometric. This hypergeometric proposal does not 
give the exact hypergeometric target in the end. It is just a reasonable marginal 
approximation. This method gives satisfactory results for examples in Section 7, 
although the cv? is not consistently small. For sparse tables, approximating the 
marginal mass function of the count in a single cell can be difficult. 


7. Examples. In the examples that follow, we sample sequentially from in- 
tervals computed with the LP approximation. The LP approximation is very close 
to or exactly equal to the IP range in all examples. In Example 7.1 one can ap- 
ply known results on Markov bases to avoid algebraic calculations, and the most 
basic results of Section 3 apply. In Example 7.2 one must do explicit algebraic 
calculations to verify the conditions of Section 3. We did a detailed numerical 
comparison with the Markov chain on Example 7.2. Example 7.3 (6-way Czech 
autoworker data) is one that requires the full theory of Markov subbases of Sec- 
tion 4 and consideration of the specific margin values to get sequential intervals 
under one model. We also study a second model for which we could not compute 
the Markov basis, and we see that SIS still works well. The no-3-way interaction 
model of Example 7.4 is a well-known example where the Gróbner basis involves 
moves of size 2, and yet the sequential theory applies perfectly. Example 7.5 is 
a classic triangular genotype table, and it brings out the importance of checking 
different cell orders. In some orders the sequential interval property holds, and in 
other quite natural orders it does not, and this can be seen in the lex basis. Finally, 
Example 7.6 is an important application of sampling on lattice points that are not 
strictly speaking contingency tables. The work of Rapallo [27] on Markov bases 
and structural zeros may be useful for other examples. 

The starting point to verify the conditions of Sections 3, 4 and 5 for a particular 
example is to attempt to compute the toric ideal 74. For this we have used the 
toric library toric.lib in the free software Singular [19] and the groebner 
command in 4ti2 [21]. The software 4ti2 was used to construct constraint matrices 
for several examples. The operations of saturation and quotient (“:”) that figure in 
the results of Sections 4 and 5 were done quickly in Singular. 

In the following examples, all results are based on 1000 random samples using 
either the uniform sampling method or the hypergeometric sampling method. The 
code was written in R [26] and the software IpSolve was called from R. The run- 
ning times range from several seconds to a few minutes on a 2.0 GHz computer. 
When IP is used instead of LP, a computation typically takes hours, ands sometimes 
it will not terminate in a reasonable amount of time. 


EXAMPLE 7.1. Consider the 3-way case/control data (Table 1) in the 4 x 4 x 2 
table from the Ille-et- Verlaine cancer study of the age 35—44 group ([3]. Appen- 
dix I). The factors are Alcohol level (A), Tobacco level (T) and Response R, where 
R = 0 is a control measurement and R = 1 1s a case. 


538 Y. CHEN, I. H. DINWOODIE AND S. SULLIVANT 


TABLE | 
Age 35—44 data on oesophageal cancer from [3] 


A 

1 2 3 4 

R-0 T 1 60 35 11 1 
2 13 20 6 3 

3 7 13 2 2 

4 8 8 l 0 

R=1 T l 0 0 0 2 
2 | 3 0 0 

3 0 1 0 2 

4 0 0 0 0 


The "case" outcomes are sampled with a multinomial distribution with proba- 
bilities p(a, t|1) on the Alcohol and Tobacco covariates. The "control" outcomes 
are also sampled with a multinomial distribution with probabilities p(a, t|0). With 
a retrospective model p(a, t|1)/p(a, t|0) = eta tPr of eight parameters, the ap- 
propriate margins to fix for conditional inference [treating p(a, t|0) as unknown 
nuisance parameters] are [A, T] (sum over case/control counts at each level), 
[A, R] and [T, R] (sums over other factor at each response level). The constraints 
imply that the Graver basis ([28], page 55) for the independence model on T and R 
is a Markov basis, and the Graver basis is equivalent to the collection of square-free 
circuit moves on one level of the Response factor. Thus, the results of Section 3 
and Corollary 5.1 imply the property of sequential intervals and LP will give the 
exact integral interval bounds at each step. 

The simulation with LP gave 10096 good tables. When the underlying distribu- 
tion is uniform, the uniform sampling method gave cv? of 0.24 and estimated the 
total number of tables to be 25, a number confirmed by LattE in a total elapsed time 
of 7 seconds on a 2.8 GHz desktop. When the underlying distribution is hyperge- 
ometric, the hypergeometric sampling method gave cv? of 0.5, and the estimated 
p-value for the exact goodness-of-fit test [defined by equations (1) and (2)] 1s 0.04. 


EXAMPLE 7.2. Consider the 4-way abortion opinion data (Table 2) from [8], 
page 129. The observations are classified according to race, sex, age and opinion. 
There are three different opinions: yes means supporting legalized abortion, no 
means opposing legalized abortion, and the last one is undecided. 

Christensen fits the log-linear model for the expected cell counts with all three- 
way interactions and all lower order terms. A shorthand notation for this model 
is to list its highest-order interaction terms: [RSO], [RSA], [ROA] and [SOA]. 
The conditional goodness-of-fit test for this model requires fixing all 3-way mar- 
gins, [R, S, O], [R, S, A], [R, O, A] and [S, O, A]. The lex basis of 165 elements 


SIS FOR MULTIWAY TABLES 539 


TABLE 2 
4-way abortion opinion data from [8] 


Race Sex Opinion 18-25 26-35 3645 46-55 5665 66+ 


White Male Yes 96 138 117 75 72 83 
No 44 64 56 48 49 60 

Undec. I 2 6 5 6 8 

Female Yes 140 171 152 101 102 111 

No 43 65 58 51 58 67 

Undec. 1 4 9 9 10 16 

Nonwtlute Male Yes 24 18 16 12 6 4 
No 5 7 7 6 8 10 

Undec. 2 1 3 4 3 4 

Female Yes 21 25 20 17 14 13 

No 4 6 5 5 5 5 

Undec. 1 2 l 1 1 I 


is square-free in the lead monomials, so the sequential interval property holds by 
Section 3 and the IP and LP lower bounds are identical. A more detailed calcu- 
lation to verify the conditions of Corollary 5.1 requires computing a grevlex ba- 
sis for each of the submatrices of A,, defined in Section 3 as the matrix that has 
columns i,i + 1,...,d from A. This can be done and the condition is verified, 
proving that LP and IP upper bounds are always the same. 

The LP method for finding the interval bounds gave 100% good tables in prac- 
tice. When the underlying distribution is uniform, the uniform sampling method 
gave cv? of 2.92 and estimated the total number of tables to be 9.1 x 107. When 
the underlying distribution is hypergeometric, the value of cv? using the hyper- 
geometric sampling method was around 102.9, and the estimated p-value for the 
exact goodness-of-fit test [defined by (1) and (2)] is 0.85 with standard error 0.1, 
based on 1000 tables which took about 5 minutes in R on a 2.0 GHz computer. The 
MCMC algorithm generated 1000 samples (with 1,000,000 samples as burn-in) in 
224 minutes and estimated the p-value to be 0.84 with standard error 0.05. Thus, 
SIS is about 11 times faster than the MCMC algorithm for this example. 

The algebraic conditions for SIS with some models on this data are difficult to 
verify. For example, 41i2 runs for an hour on a 2.8 GHz Linux desktop with 1 GB 
of memory without completing the Markov basis calculation on the model [RS], 


[RA], [RO], [SO], [SA], [OA]. 


EXAMPLE 7.3. Consider the 6-way binary Czech autoworker data in Table 3 
from a prospective study of probable risk factors for coronary thrombosis [18]. 
There are 1,841 men in a car factory involved in the study. Here A, B, C, D E 
and F indicate different risk factors. One reasonable model is given by [ACDEF], 
[ABDEF], [ABCDE], [BCDF], [ABCF], [BCEF] [17]. The conditional goodness- 
of-fit test for this model requires fixing the three 5-way and the three 4-way 


540 Y CHEN, I. H. DINWOODIE AND S. SULLIVANT 


TABLE 3 
6-way Czech autoworker data from [18] 


B no yes 

F E D C A no yes no yes 
Negative «3 «140 no 44 40 112 67 
yes 129 145 12 23 
>140 no 35 12 80 33 
yes 109 67 7 9 
>3 <140 no 23 32 70 66 
yes 50 80 (0)7 13 
>140 no 24 25 73 57 
yes 5I 63 7 16 

Positive <3 <140 no 5 7 21 9 
yes (0) 9 17 (0) 1 (0) 4 
140 no (0) 4 3 11 8 
yes 14 17 5 (0) 2 
>3 <140 no 7 (0) 3 14 14 

yes 9 16 (0) 2 (0) 3 

> 140 no (0) 4 (0) 0 13 11 

yes (0) 5 14 (0) 4 4 


margins in the above model representation. Implementing SIS for this example 
requires techniques beyond the basic methods of Section 3, because the lex basis 
does not have square-free lead exponents. 

In Table 3 (0) indicates that the LP lower bound for that cell entry is 0 with the 
constraints from the model above; the others are strictly positive. Identifying these 
cells is relevant when we apply Propositions 4.1, 5.1 and 5.2, as the (0) cells form 
the complement of the set Sọ (defined in Proposition 5.1). 

The lex basis for the toric ideal with lex order in indeterminates yields 20 ele- 
ments, the first of which has an exponent of 2 on the lead indeterminate x111111. 
Therefore, Proposition 3.1 cannot be applied directly. However, the ideal gener- 
ated by the other 19 polynomials saturates in one step with respect to the mono- 
mial | [; £s xs, where S is the set of 41 coordinates that must be positive. Hence, by 
Lemma 4.2 these 19 moves are a Markov subbasis. They are a lex Gróbner basis for 
themselves, and they have the saturation property required in Proposition 4.1, so 
the sequential interval property holds. Furthermore, Proposition 5.1 shows that the 
IP and LP lower bounds are always the same [which also implies that the (0) cells 
in the rationals are the same cells as those that could be 0 in the integers]. Corol- 
lary 5.1 does not apply to show that the LP and IP upper bounds are the same, 
because exponents of 2 appear in the grevlex bases. We can use Proposition 5.2 on 
successive cells to show that LP and IP are the same after a few initial cells. 


SIS FOR MULTIWAY TABLES 541 


If the cells are filled in across rows and then down, the order is 111111, 211111, 
121111,... (the order from 4112). The sequential interval property holds in this 
order as well. 

For this model, using LP for interval bounds gave 100% good tables. The shut- 
tle algorithm gave 99% good tables with one iteration and 99.5% with two itera- 
tions. When the underlying distribution is uniform, the uniform sampling method 
gave cv^ of 1.09 and estimated the total number of tables to be 841. The quan- 
tity cv? when targeting the hypergeometric distribution using the hypergeometric 
sampling method was 50.7, and the estimated p-value for the exact goodness-of- 
fit test [defined by (1) and (2)] is 0.27. Fitting this model in R using the Lloglin 
command gives a x^ statistic of 5.8 on 4 degrees of freedom, for a p-value of 
approximately 0.21. 

Consider the model of all 15 four-element constraints like [A, B, C, D], that is, 
all 4-way margins. We could not obtain the Markov basis for this model, but SIS 
still works well with cv? = 5.0 when the target distribution is uniform. LP gave 
100% good tables, whereas the shuttle algorithm gave only 2% good tables after 
10 iterations. 


EXAMPLE 7.4. Consider the 3 x 3 x 3 example (Table 4) from [14], page 379, 
with a model of no-3-way interaction. The conditional goodness-of-fit test for this 
model requires fixing all “line sums." 

When ordered left to right across rows, Proposition 3.1 implies sequential inter- 
vals and Corollary 5.1 gives an IP/LP gap of 0 at every step. In simulation LP gave 
10096 good tables, and the shuttle algorithm also gave 10096 good tables after one 
iteration. 

When the underlying distribution is uniform, the uniform sampling method 
gave cv? of 2.08 and estimated the total number of tables to be 1.9 x 10!*. This 
is consistent with the number 1,919,899,782,953 from LattE, computed in a total 
elapsed real time of 45 seconds on a 2.8 GHz desktop computer. When target- 
ing the hypergeometric distribution, the hypergeometric sampling method gave 
cv? = 180.7. 


EXAMPLE 7.5. Consider data in Table 5 of genotype pairs from [20]. The 
constraints for conditional goodness-of-fit test of Hardy-Weinberg proportions are 
the nine allele counts, which are nine linear functions that count twice the diagonal 


TABLE 4 
3 x 3 x 3 table from [14] 


9 16 41 8 8 46 Il M 38 
85 52 105 35 29 54 47 35 115 
7] 30 38 AJ. d5. 22 2 2l 42 


542 Y. CHEN, I. H. DINWOODIE AND S SULLIVANT 


TABLE 5 
Rhesus data from [20] 





| By B2 B3 Bae Bs Be By Bg Bo 


entry, so the A matrix has entries 0, 1 and 2. For sequential sampling, the order of 
cells given by Table 6 leads to sequential intervals by Proposition 3.1. In general, 
for the genotype problem sampling across rows will not give intervals. 

The lead monomials in a lex basis have exponents that are all O or 1, so LP 
gives the exact lower bounds by Proposition 5.1. For the upper bound, the grevlex 
condition of Proposition 5.2 does not hold from the first cell, but it does hold 
after a few cells, so IP and LP give the same bounds after some initial cells. The 
simulation with LP produced 100% good tables. See [23] for a direct sampling 
strategy and some further discussion of this example. 


EXAMPLE 7.6. Consider a constraint matrix A of the form A = (Ao|J) with 
0 or | entries. Here 7 is the e x e identity matrix and Áo is size e x f with columns 
3j,..., ay. This occurs in a tomography problem introduced by Vardi [31], where 
A is a routing matrix for which routes between adjacent vertices use the connecting 
edge, and the edge counts are put last as slack variables. The integer data y — Ax, 
where x are traffic counts between ordered pairs of nodes on a graph and y is the 
aggregate traffic across links. The sampling method of Tebaldi and West [30] for 
Bayesian computation of the posterior distribution is closely related to sequential 
sampling. Dinwoodie [15] shows how fast sampling can be used in a Monte Carlo 
EM algorithm for estimating traffic rates. 


TABLE 6 
Order of cells 


SIS FOR MULTIWAY TABLES 543 


The property of sequential intervals holds for the entries of x under the con- 


straint Ax = y in the order of the columns. With indeterminates w1,..., wy for 
the first f columns and z),..., Ze for the last e slack variables, a lex Gróbner basis 
in Q[wi, ..., wf, zt, ... ze] consists of the f binomials w; — z^. 


Linear programming will give the exact interval bounds at each step because 
of the square-free lead monomials. The shuttle algorithm will also give the exact 
intervals in one step. The interval for the first cell is exactly [0, ming. 2, ,.0,12:« f] 
{y;}] and the same type of problem recurs at each step 1,..., f. 

It is possible to establish properties of SIS for some classes of examples, or, 
in other words, for some types of contingency tables with certain constraints and 
margin values. This is an area of ongoing work, but at this time we can make 
some statements. Logistic regression tables with one integer covariate and positive 
column sums (at least one measurement at each level of the covariate) have the 
sequential interval property. This is proved in [7]. The subbasis that corresponds to 
differences of adjacent minors satisfies the conditions of Proposition 4.1. However, 
the IP/LP gap may not be zero. 

Also, two-way tables with structural zeros and fixed row and column sums have 
sequential intervals. The same algebraic technology also shows that case/control 
data with two factors, such as Example 7.1, has the sequential interval property. 
We conjecture that decomposable graphical models will have the sequential inter- 
val property under some order on the cells, but at this time a careful proof 1s not 
complete. 


8. Conclusion. We have described an efficient sequential importance sam- 
pling method for sampling multiway tables with given constraints. It can be used 
to approximate exact conditional inference on contingency tables. SIS sequentially 
builds up the proposal distribution by sampling table entries one by one. We have 
presented a theory that relates algebraic properties of collections of Markov moves 
to certain geometric properties of contingency tables. The geometric properties of 
"sequential intervals" and the relationship of IP to LP are important for the perfor- 
mance of sequential sampling. Many real examples show that the theory is applica- 
ble and useful, and can be used in some examples when a Markov basis cannot be 
found. 

In practice, one may try sequential sampling even if the sequential interval prop- 
erty does not hold or if the algebraic conditions are not satisfied or not checked. 
If one can find rough bounds for each entry and design the proposal distribution 
carefully, so that the fraction of valid tables is high and the cv” is low, SIS may 
still give satisfactory results. 

Further work is required to formulate a method to design the proposal distribu- 
tion at each step. We have seen that the uniform sampling method works very well 
when the underlying distribution is uniform. However, when the target distribution 
is hypergeometric, the hypergeometric sampling method could be improved. 


544 Y. CHEN, I. H. DINWOODIE AND S. SULLIVANT 


Acknowledgments. We have used the software 41i2, lpSolve, R and Singu- 
lar for computations. We thank Sam Buttrey for the IpSolve package in R and 
Raymond Hemmecke for a version of 4ti2. A referee has suggested that Propo- 


; -1 
sition 4.1 holds with the weaker condition (Iy : [leg xp ^ vs: 26^ 001) = p, and 
we think that further theoretical work on connectivity of Markov chains with the - 
goal of computational efficiency would be valuable. 


REFERENCES 


[1] BERKELAAR, M., EIKLAND, K and NOTEBAERT, P (2004) IpSolve: Open Source (Mixed- 
Integer) Linear Programming System GNU LGPL (Lesser General Public License) 

[2] BESAG, J. and CLIFFORD, P. (1989) Generalized Monte Carlo significance tests. Biometrika 
76 633—642. MR1041408 

[3] BRESLOW, N. E and DAY, N E. (1980). Statistical Methods in Cancer Research 1. The Analy- 
sis of Case-Control Studies. International Agency for Research on Cancer, Lyon, France. 

[4] BUNEA, F. and BESAG, J. (2000). MCMC in 7 x J x K contingency tables. In Monte Carlo 
Methods (N. Madras, ed.) 25-36. Amer. Math. Soc., Providence, RI. MR1772304 

[5] BUZZIGOLI, L. and GIUSTI, A. (1999). An algorithm to calculate the lower and upper bounds 
of the elements of an array given its marginals In Proc Conference on Statistical Data 
Protection 131—147. Eurostat, Luxembourg. 

[6] CHEN, Y., DIACONIS, P., HOLMES, S and LIU, J S. (2005) Sequential Monte Carlo methods 
for statistical analysis of tables. J. Amer Statist. Assoc. 100 109-120 MR2156822 

[7] CHEN, Y., DINWOODIE, I., DOBRA, A. and HUBER, M. (2005) Lattice points, contingency 
tables and sampling. In Integer Points in Polyhedra—Geometry, Number Theory, Algebra, 
Optimization (A. Barvinok, M. Beck, C. Haase, B. Reznick and V Welker, eds.) 65—78 
Amer Math. Soc , Providence, RI. MR2134761 

[8] CHRISTENSEN, R. (1990). Log-Linear Models. Springer, New York MR1075412 

[9] CoNTI, P and TRAVERSO, C (1991). Buchberger algorithm and integer programmung. 
Applied Algebra, Algebraic Algorithms and Error-Correcting Codes. Lecture Notes in 
Comput. Sci. 539 130-139 Springer, Berlin. MR1229314 

[10] Cox, D , LITTLE, J. and O'SHEA, D (1997). Ideals, Varieties, and Algorithms, 2nd ed. 
Springer, New York. MR1417938 

[11] DE LOERA, J. A., HAWS, D., HEMMECKE, R., HUGGINS, P., TAUZER, J. and YOSHIDA, R. 

(2003) A User's Guide for LattE v1.1. Available at www.math.ucdavis.edu/~latte/, 

[12] DE LOERA, J. A. and ONN, S. (2006). Markov basis of three-way tables are arbitrarily com- 
plicated. J. Symbolic Comput. 41 173—181. 

[13] Draconis, P and EFRON, B. (1985). Testing for independence ın a two-way table: New 
interpretations of the chi-square statistic (with discussion). Ann. Statist 13 845—913. 
MR0803747 

[14] DiACONIS, P. and STURMFELS, B. (1998). Algebraic methods for sampling from conditional 
distributions Ann. Statist. 26 363-397. MR1608156 

[15] DINWOODIE, I. H. (2000). Conditional expectations in network traffic estimation. Statist. 
Probab. Lett. 47 99-103 

[16] DOBRA, A and FIENBERG, S. (2001) Bounds for cell entries ın contingency tables induced 
by fixed marginal totals with applications to disclosure limitation. Statistical J. United 
Nations Economic Commission for Europe 18 363-371 

[17] DOBRA, Á., TEBALDI, C. and WEST, M. (2006). Data augmentation in multi-way contin- 
gency tables with fixed marginal totals. J Statist Plann. Inference 136 355—372 


[18] 
[19] 
. [29] 
[21] 
[22] 
[23] 
[24] 
[25] 
[26] 


[27] 
[23] 


[29] 
[30] 


[31] 


SIS FOR MULTIWAY TABLES 545 


EDWARDS, D. and HAVRÁNEK, T. (1985). A fast procedure for model search in multidimen- 
sional contingency tables. Biometrika 72 339—351. MR0801773 

GREUEL, G.-M., PFISTER, G and SCHOENEMANN, H (2003). Singular: A computer algebra 
system for polynomial computations. Available at www singular.uni-kl.de. 

Guo, S. W. and THOMPSON, E. A. (1992). Performing the exact test of Hardy—Weinberg 
proportion for multiple alleles. Biometrics 48 361—372. 

HEMMECKE, R. and HEMMECKE, R. (2003) 4112 Version 1.1: Computation of Hilbert bases, 
Graver bases, toric Grobner bases, and more. Available at www 4ti2.de. 

HOSTEN, S. and STURMFELS, B. (2006). Computing the integer programming gap. Combina- 
torica. To appear 

HUBER, M., CHEN, Y., DINWOODIE, I., DOBRA, A. and NICHOLAS, M. (2006). Monte 
Carlo algorithms for Hardy-Weinberg proportions. Biometrics 62 49-53. 

KONG, A., LIU, J S. and WONG, W. H. (1994). Sequential 1mputations and Bayesian missing 
data problems. J Amer Statist. Assoc. 89 278—288 

KREUZER, M and ROBBIANO, L (2000). Computational Commutative Algebra 1. Springer, 
Berlin. MR1790326 

R DEVELOPMENT CORE TEAM (2004). R. A language and environment for statistical com- 
puting Available at www.r-project.org. 

RAPALLO, F. (2006) Markov bases and structural zeros. J. Symbolic Comput. 41 164—172. 

STURMFELS, B. (1996). Grobner Bases and Convex Polytopes. Amer. Math. Soc., Providence, 
RI. MR1363949 

STURMFELS, B. (2002). Solving Systems of Polynomial Equations. Amer. Math. Soc., Provi- 
dence, RI. MR1925796 

TEBALDI, C. and WEST, M. (1998) Bayesian inference on network traffic using link count 
data (with discussion). J. Amer Statist. Assoc. 93 557—576. MR1631325 

VARDI, Y. (1996) Network tomography: Estimating source-destination traffic intensities from 
link data J. Amer. Statist. Assoc 91 365—377. MR1394093 


Y. CHEN I H DINWOODIE 

DEPARTMENT OF STATISTICS INSTITUTE OF STATISTICS 

UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN AND DECISION SCIENCES 

CHAMPAIGN, ILLINOIS 61820 DUKE UNIVERSITY 

USA DURHAM, NORTH CAROLINA 27708-0251 


E-MAIL yuguo@uwiuc.edu USA 


E-MAIL 1hdGstat duke edu 


S SULLIVANT 

DEPARTMENT OF MATHEMATICS 
HARVARD UNIVERSITY 

ONE OXFORD STREET 

CAMBRIDGE, MASSACHUSETTS 02138 
USA 

E-MAIL seths ? math.harvard edu 


The Annals of Statizes 

2006, Vol 34, No 1, 546—558 

DOI 10 1214/009053605000000813 

© Inshtute of Mathematical Statistics, 2006 


DOUBLING AND PROJECTION: A METHOD OF CONSTRUCTING 
TWO-LEVEL DESIGNS OF RESOLUTION IV! 


Bv HEGANG H. CHEN AND CHING-SHUI CHENG 


University of Maryland School of Medicine, and Academia Sinica 
and University of California, Berkeley 


Given a two-level regular fractional factorial design of resolution IV, the 
method of doubling produces another design of resolution IV which doubles 
both the run size and the number of factors of the initial design. On the other 
hand, the projection of a design of resolution IV onto a subset of factors 1s of 
resolution IV or higher. Recent work in the literature of projective geometry 
essentially determines the structures of all regular designs of resolution IV 
with n > N/4+ 1 in terms of doubling and projection, where N is the run 
size and n 1s the number of factors. These results imply that, for instance, all 
regular designs of resolution IV with 5N/16 <n < N/2 must be projections 
of the regular design of resolution IV with N/2 factors. We show that, for 
9N/32 x n € 5N/16, all mnimum aberration designs are projections of the 
design with 5N/16 factors which is constructed by repeatedly doubling the 
25-1 design defined by 7 = ABCDE. To prove this result, we also denve 
some properties of doubling, including an identity that relates the wordlength 
pattern of a design to that of its double and a result that does the same for the 
alias patterns of two-factor interactions. 


1. Introduction. Doubling is a simple but powerful method of constructing 
two-level fractional factorial designs, in particular, those of resolution IV. Suppose 
X is an N x n matrix with two distinct entries, 1 and —1. Then the double of X, 
denoted by D(X), is the 2N x 2n matrix 


x x] 


pm =|; j|ex 


that is, 


where ® is the Kronecker product. Suppose X defines an N-run design for n two- 
level factors, where the two levels are denoted by 1 and —1, each column of X 
corresponds to a factor and each row of X defines a factor-level combination. 


Received June 2004; revised March 2005. 
l Supported in part by NSF Grant DMS-00-71438. 
AMS 2000 subject classification. 62K 15. 
Key words and phrases Maximal design, minimum aberration, orthogonal array, wordlength pat- 
tern 


546 


DOUBLING AND PROJECTION 547 


Then D(X) defines a design which doubles both the run size and the number of 
factors of X. Such a method was used by Plackett and Burman [10] in their classi- 
cal paper on orthogonal main-effect plans. 

An N x m submatrix of X, where m < n, is called a projection of X (onto m 
factors). Equivalently, such a design can be obtained by deleting n — m columns 
(factors) from X. We allow the possibility m — n, so that a design is considered 
as its own projection onto all columns. If X is of resolution IV, then all projec- 
tions of X are of resolution IV or higher. A less obvious fact is that ir X is of 
resolution IV, then D(X) is also of resolution IV. Thus, starting with a design of 
resolution IV, the combined operation of doubling followed by projection yields a 
design of resolution IV or higher. 

Some recent results in the literature of finite projective geometry [2, 3, 8] essen- 
tially characterize, in terms of doubling and projection, the structures of regular 
designs of resolution IV with n > N/4+ 1. These elegant and difficult results 
have important implications in statistical design, but are not easily accessible to 
statisticians since they were stated in the language of projective geometry. One 
purpose of this paper is to review some of these results and rephrase them in de- 
sign language. This should be useful to people interested in the statistical design of 
experiments. In the past few years we have encountered several occasions where 
some special cases of these results were either re-discovered by statisticians or 
would have been very helpful if they had been become more widely known in the 
statistical literature. 

We also present some new results, including an identity that relates the 
wordlength pattern of a design to that of its double and a result that does the 
same for the alias patterns of two-factor interactions. These results and some other 
basic properties of doubling are presented in Section 2. Characterization of regular 
designs of resolution IV with n > N/4 -+ 1 in terms of doubling and projection is 
discussed in Section 3. Section 4 shows how the results in Sections 2 and 3 can 
be used to investigate certain minimum aberration designs. In particular, we show 
that, for 9N/32 <n <5N/16, a minimum aberration design can be obtained by 
deleting factors from the minimum aberration design with 5N/16 factors which 
can be constructed by repeatedly doubling the 2°~! design defined by J = ABCDE. 

We conclude this section with a review of some basic terminology and notation. 
A factorial design is called an orthogonal array of strength t if, in all projections 
onto f factors, all factor-level combinations appear the same number of times. 
Regular fractional factorial designs are those which can be constructed by using 
defining relations and are discussed in many textbooks on experimental design; 
see, for example, [11]. Each interaction that appears in the defining relation is 
called a defining word, and the resolution of a regular design is defined as the 
length of the shortest defining word. It is well known that a regular design of 
resolution R is an orthogonal array of strength R — 1. For each positive integer i, 
let B; be the number of defining words of length i. Then the resolution is equal 
to the smallest i such that B, > 0. The sequence (B1, B5, ..., Bn) is called the 


548 H. H CHEN AND C -S CHENG 


wordlength pattern of the design. The minimum aberration criterion introduced by 
Fries and Hunter [9] chooses a design by sequentially minimizing B1, B2, B3,... 

Throughout this paper N and n denote the run size and the number of factors, 
respectively. We shall restrict consideration to two-level designs only. Under a 
regular two-level design, N must be a power of 2, say, N — 2" P. This design is a 
z> -fraction of a complete 2" factorial, and is usually referred to as a 2" P design. It 
is well known that a regular two-level design of resolution III must have n < N — 1, 
and a regular two-level design of resolution IV must have n < N/2. A design of 
resolution III is called saturated if n = N — 1, and a design of resolution IV is 
called saturated if n = N /2. 


2. Some basic properties of doubling. We first note that if X is a Hadamard 
matrix of order N, then D(X) is a Hadamard matrix of order 2N. In fact, this 
was what Plackett and Burman used in their 1946 classic paper. The following 
properties can easily be established: 


THEOREM 2.1. Jf X is an orthogonal array of strength two, then D(X) is 
also an orthogonal array of strength two. Likewise, if X is an orthogonal array of 
strength three, then D(X) is an orthogonal array of strength three. 


Theorem 2.1 has no counterpart for designs of higher strength. In fact, for two 
columns a and b of X, D(X) must have four columns of the form [^? ? P], 
whose componentwise product has all the entries equal to 1. Therefore, D(X) can- 
not have strength higher than three. For regular designs, these four columns would 
lead to a defining word of length four. 

Another important elementary fact is that saturated regular designs of resolu- 
tion III are unique (up to isomorphism). Such a design of size 2* can be obtained 
by deleting the first column of 


1 1 1 1 I Í 
f aeli aee]; al 
————————————— 
k 


In other words, saturated regular designs of resolution III can be obtained by 
deleting a column of 1’s after successively doubling the 2x2 matrix [|_|]. Since 
all regular designs of resolution III or higher can be constructed by deleting a sub- 
set of columns from a saturated regular design of resolution III (or, equivalently, 
by projecting the saturated regular design of resolution III onto the complementary 
set of columns), all two-level regular designs of resolution III or higher can be con- 
structed by the operation of doubling followed by deletion (or projection). These 
observations also imply that if X 1s a regular design. then D(X) is regular. In other 
words, in Theorem 2.1, if X is a regular design of resolution III (resp. IV), then 


DOUBLING AND PROJECTION 549 


D(X) 1s also a regular design of resolution III (resp. IV). However, by the com- 
ment following Theorem 2.1, if X is a regular design of resolution higher than IV, 
then D(X) is a regular design of resolution IV only. 

saturated regular designs of resolution IV are also unique (up to isomorphism). 
One way of constructing such designs 1s to foldover saturated regular designs of 
resolution III. We would like to present another neat and compact method of con- 
struction using doubling. Consider the 2? complete factorial 


1 ] 

D =l 
—1 1?’ 
—] +! 


whose number of factors is precisely half of the run size. Successively doubling the 
2* complete factorial results in designs whose number of factors is half of the run 
size, and by the discussion at the end of the previous paragraph, are regular designs 
of resolution [V. This implies that saturated regular designs of resolution IV can 
be obtained by successively doubling the 2? complete factorial. 

As mentioned earlier, the designs obtained by doubling those of resolution III 
or higher are of resolution III or IV. To study statistical properties of such designs, 
it is important to consider the alias pattern of two-factor interactions. Suppose 
N —2"^P, Among the 2" — I factorial effects, 2? — 1 appear in the defining re- 
lation. The rest are divided into g = 2" P — 1 alias sets, each of size 2P. Without 
loss of generality, assume that the first f = 2"^P — 1 — n of these alias sets does 
not contain main effects. For each 1< i < g, let m; be the number of two-factor 
interactions in the ith alias set. Cheng, Steinberg and Sun [7] showed that 


Q.1) B3 = i()- em. 
(2.2) aa i Yom? —(3) | 


Furthermore, estimation capacity, a measure of model robustness discussed in [7], 
can be expressed in terms of the m;’s. The following result relates the m; values 
of a design to those of its double. 


THEOREM 2.2. Suppose a regular design X has g alias sets, the first f of 
which do not contain main effects, and for 1 <i € g, m, is the number of two- 
factor interactions in the ith alias set. Let the corresponding numbers of D(X) 
be g*, f* and mř, respectively. Then g* — 2g + 1, f* —2f +1, m{ = m} = 
àm,,m^ = mj = 2m»,...,m5g 4 = m5, = my, M3 f4] — n, and m p 4» = 
m5 e43 ESL ED usum m^. i = moe —2m,. 


550 H. H. CHEN AND C.-5. CHENG 


PROOF. Suppose X is a 2"~? design. Then D(X) is a 27-t*P-D design, 
and g= nep]. Yi = 2P7P — 1 —n, g* —- 22n—(n4- p-1) ex, f e 22n— (n p-1) = 
2n — 1. The relations f* = 2f + 1 and g* = 2g¢+ 1 follow. 

To each factor in X, say A, there correspond two factors in D(X). Suppose the 
column of X corresponding to A is a. We shall denote the factor in D(X) corre- 
sponding to the column [f] by A* and the factor corresponding to | *] by A^. 
Then it is easy to see that, for any two factors A and B in X, under D(X) AT B* 
and A^ B- are aliased , A^ B^ and A^ B* are aliased, and A* A^ and Bt B^ 
are aliased. Consequently, if the two-factor interactions AB and C D are aliased 
under X, then A* B*, A^ B^, C^ D* and C^ D^ are aliased under D(X), and 
A*B- ,A B*,C*D- and C^ D* are also aliased. If the main effect A and two- 
factor interaction C D are aliased under X, then At, Ct D+ and C^ D^ are aliased 
under D(X), and A^, Ct D^ and C^ D* are also aliased. Using these facts, it can 
be seen that each alias set of two-factor interactions under X determines two alias 
sets of two-factor interactions under D(X), each of which is twice as large as the 
onginal alias set under X. Furthermore, all the n two-factor interactions of the 
form At A^ constitute another alias set which does not contain main effects. O 


We have the following relationship between the wordlength pattern of X and 
that of D(X). 


THEOREM 2.3. Let By and By be the number of defining words of length k of 
X and D(X), respectively. Then 


min[(k—1)/2,n] 


);, Be»: E T "ual 2F-5-1. — ifkis odd; 


s=0 
min[k/2—1,n] n 
—k —25— 
J E Boe(" V^ xu). 
B, xx s=0 2 4 
if k is even and k/2 is even; 
min[k/2—1,n] 


Bos E v Tet ee eevenand k D eodd. 
s=0 


PROOF. If Aj, ++ A,, is a defining word of X, then A = AT, where each j 
is a + or —, is a defining word of D(X) as long as an even number of the j,’s 


are —'s. When s is even, for any s distinct factors Àj, ..., Árs Aj Aj, e ATA, 
is also a defining word of D(X). In general, a defining word of D(X) is of 
the form Aj A; TE ALA AL T A where À,,,..., Årg, are distinct factors 


of X, Ai, 7: Aq is a defining word of X, and an even (resp. odd) number of 


DOUBLING AND PROJECTION 551 


the ji's are —'s when s is even (resp. odd). If A7 A7 --- AT AL ARto AX is of 
length k, then t = k — 2s. This implies that 0 < s  min[k/2, n]. On the other hand, 
when s is odd, we must have t > 1. Therefore, we have s < min[(k — 1)/2, n] when 
k is odd, s < min[K/2, n] when k is even and k/2 is even, and s x min[k/2 — 1, n] 
when k is even and k/2 is odd. The theorem then follows from the fact that, for 


given s, there are 21-1 = 29k-2s-1 ways to choose (js41,..-; Js+t) Wt > 1, Bk-2s 

ways to choose A,,,,,..., Ai,,, and Eus) ways to choose Aj,,..., A- Note 

that the last term ( ) in the case where k/2 is even corresponds to s = k/2. In this 
n + A- + 4- 

case, there are (£) ways to choose A, A; -- ELTE L] 


In particular, for designs of resolution III or higher we have By = 4B3 and 
B; = 8B4 + (5). These identities can also be derived by using (2.1), (2.2) and 
Theorem 2.2. 

The following is an immediate consequence of Theorem 2.3. 


COROLLARY 2.4. Given two regular designs X, and X5, X, has less aberra- 
tion than X» if and only if D(X1) has less aberration than D(X2). 


3. Maximal designs of resolution IV. One major difference between designs 

of resolution III and IV is that all regular designs of resolution III can be con- 
structed by deleting factors from saturated regular designs of resolution III. On the 
other hand, not all regular designs of resolution IV can be obtained by deleting fac- 
tors from saturated regular designs of resolution IV: while all designs obtained by 
deleting factors from saturated regular designs of resolution IV are the so-called 
even designs which have no defining words of odd lengths, there are designs of 
resolution IV which are not even designs. 
— We say that a regular design of resolution IV or higher is maximal if its res- 
olution reduces to three whenever an extra factor is added. Clearly, the saturated 
regular designs of resolution IV are maximal. If a design is not maximal, then at 
least one factor can be added so that the design is still of resolution IV or higher. 
One can keep adding factors until it becomes maximal. Therefore, if a regular de- 
sign of resolution IV is not maximal, then it can be obtained by deleting factors 
from a maximal design. Because of the importance of this fact, we state it formally 
as a proposition: 


PROPOSITION 3.1. Every regular design of resolution IV is a projection of a 
certain maximal regular design of resolution IV or higher. 


We can also define maximal regular designs of resolution III. It turns out that, 
for a given run size, there is only one maximal regular design of resolution IIT: the 
saturated regular design of resolution III. As mentioned earlier, saturated regular 


552 H. H. CHEN AND C -S. CHENG 


designs of resolution IV are maximal, but they are not the only maximal designs 
of resolution IV. 

Theoretically speaking, for a fixed number of runs, if we can determine all 
the maximal designs of resolution IV or higher, then all regular designs of res- 
olution IV can be constructed by projecting the maximal designs onto subsets of 
factors. Recently some significant progress in the determination of such maximal 
designs has been made in the literature of finite projective geometry. Note that 
maximal designs of resolution IV or higher are equivalent to maximal caps in a 
finite projective geometry. 

We first state a simple but interesting characterization of maximal regular de- 
signs of resolution IV. 


THEOREM 3.2. A regular design of resolution IV is maximal if and only if 
m, 7 O0for alli -1,..., f. 


This result was first stated in the coding-theoretic language; see, for exam- 
ple, [2]. Chen and Cheng [5] rephrased it in the above form. They also defined 
the notion of estimation index. Another way to state the result in Theorem 3.2 1s 
that a regular design of resolution IV is maximal if and only if its estimation in- 
dex is equal to 2. Since we can estimate one effect from each alias set assuming 
that the other effects in the same alias set are negligible, Theorem 3.2 says that a 
regular design of resolution IV is maximal if and only if all the available degrees 
of freedom can be used to estimate main effects and two-factor interactions. Such 
designs are said to be second-order saturated by Block and Mee [1]. 

The following result, whose geometric version can also be found in [2], reveals 
the crucial role played by the method of doubling in constructing designs of reso- 
lution IV. 


THEOREM 3.3. Let X be a regular design of resolution IV or higher. Then X 
is maximal if and only if D(X) is maximal. 


One can see that Theorem 3.3 follows immediately from Theorem 2.2 and The- 
orem 3.3. 

The following two key results essentially determine the structures of regular 
resolution IV designs with N/4 4- 1 <n x N/2. 


THEOREM 3.4 ([2, 8D. Every maximal regular design of resolution IV with 
N/4 -- 2 <n x N/2 can be obtained by doubling a maximal regular design of 
resolution IV or higher. 


THEOREM 3.5 ((3]). For each N = 2* with k > 4 there exists at least one 
maximal regular design of resolution IV or higher with n= N/4 4 1. 


DOUBLING AND PROJECTION 553 


One family of maximal designs with n = N/4 + 1 can be found in Tang, Ma, 
gram and Wang's [13] study of designs with maximum number of clear two- 
actor interactions. 
A complete search shows that there are two maximal regular 16-run designs of 
solution IV or higher. One is the saturated 25-4 design of resolution IV, and the 
other is the 2°~! design defined by I = ABCDE, whose existence is assured by 
Theorem 3.5. Repeatedly doubling the former yields all larger saturated regular 
designs of resolution IV, while successively doubling the latter (a design of resolu- 
tion V) leads to a family of maximal designs of resolution IV with n — 5N/16, for 
N= 2k. k > 5. Since there are no 16-run maximal regular designs of resolution IV 
with 5 <n < 8, by Theorem 3.4, for all N = 2* with k > 5, there are no maxi- 
mal designs of resolution IV with 5N/16 « n « N/2. This leads to the following 
important conclusion. 


COROLLARY 3.6. For5N/16 «n « N/2,a regular design of resolution IV 
must be obtained by deleting columns from saturated designs of resolution IV. All 
such designs are even designs. 


Doubling the two maximal 16-run regular designs of resolution IV, we obtain 
the saturated 216-1! design of resolution IV and a 210-? design, both of which are 
maximal. By Theorem 3.5 there exists at least one maximal regular 32-run design 
of resolution IV with n = 9. In fact, there is exactly one such design. Therefore, 
for 9 <n < 16 there are exactly three 32-run maximal designs of resolution IV: 
a 210711, a 210 and a 2?-^, The last one and its repeated doubles constitute a 
family of maximal regular designs of resolution IV with n =9N/32, for N = 2*, 
k > 5. Again, there are no maximal regular designs of resolution IV with 9N /32 < 
n <5N/16. Thus, for the n's in this range, a regular design of resolution IV must 
be obtained by deleting columns from either the saturated design of resolution IV 
or the maximal regular design of resolution IV with n = 5N/16. 

Doubling the three maximal 32-run designs of resolution IV, we obtain the sat- 
urated 232-26 design of resolution IV, a 220-14 design and a 218-12 design, all 
of which are maximal. By Theorem 3.5 there exists at least one maximal regular 
64-run design of resolution IV with n — 17. Block and Mee's [1] complete search . 
shows that there are five such designs. Therefore, for 17 < n < 32 there are eight 
64-run maximal regular designs of resolution IV: a 23276, a 220-14. 4 218-12 and 
five 2!7-!!, One can also conclude that there are exactly five maximal regular de- 
signs of resolution IV with n = 17N /64, for all N = 2* with k > 6. 

Now it is clear that if N = 2*, k > 4, then for n > N/4 + 1 a maximal regular 
design of resolution IV or higher must have 


(3.1) n € (N/2, 5N/16,9N/32, 17N /64, 33N /128, ...). 


Conversely, for each integer n = (2! + 1)N /2! **, there exists at least one maximal 
regular N -run design of resolution IV or higher with n factors. A maximal regular 


554 H. H CHEN AND C.-5. CHENG 


design of resolution IV or higher with n = (2! + 1)N/2'+? and N 22, k >i+2, 
can be constructed by repeatedly doubling a maximal regular 2! *?-run design of 
resolution IV or higher with 2' + 1 factors. 


4. Some results on minimum aberration designs. By the discussion in 
the previous section, there are maximal regular designs of resolution IV with 
n = N/2, 5N/16,9N/32, 17N/64, 33N /128, .... Those with n = N/2, the satu- 
rated regular designs of resolution IV, are known to have minimum aberration. For 
5N/16 « n « N/2, minimum aberration designs (in fact, all regular designs of 
resolution IV) must be projections of the saturated regular design of resolution IV. 
Butler [4] addressed the issue of deleting factors from the saturated regular design 
of resolution IV so that the resulting design has minimum aberration. This is rem- 
iniscent of the complementary design theory of Chen and Hedayat [6], Tang and 
Wu [14] and Suen, Chen and Wu [12] that deals with how to find a set of factors so 
that its complement in the saturated regular design of resolution III has minimum 
aberration. 

In an unpublished work, N. A. Butler found the minimum aberration design 
with n — 5N/16 which, using the terminology in this paper, is the maximal reg- 
ular design of resolution IV with n — 5N/16. In this section we shall show that, 
for 9N/32 x n < 5N/16, the minimum aberration designs are projections of the 
maximal regular design of resolution IV with n — 5N/16. Thus, although for 
n > 9N/32 minimum aberration designs are projections of the maximal regular 
design of resolution IV with either N/2 or 5N/16 factors, the first two in (3.1), 
the pattern breaks down at 9N /32. Even the maximal design of resolution IV with 
n —9N/32 itself does not have minimum aberration. 

Before proceeding to the proof of this result, we shall explore a bit more the 
alias pattern of two-factor interactions under the maximal design of resolution IV 
with n = 5N/16. Suppose N = 16-2', where t > 1, and let X* be the maximal de- 
sign of resolution IV with 5 - 2! factors. Then X* can be obtained by doubling the 
2?-1 design defined by | = ABCDE t times. Suppose a, b, c, d and e are the five 
columns of this 2°~! design corresponding to factors A, B, C, D and E, respec- 
tively. Then each of a, b, c, d and e generates 2' columns of X*. For example, each 
of the 2' columns of X* generated by a takes the form x, ® --- 9 x; Q a, where 
x, = [1, 1]? or [1, —1]7. We shall denote the corresponding factor of X* by AJ, 
where j= (j1,..-, Jr), with jj = 1 if x, = [1, 1] and j, = —1 if x, =[1,—1]’. 
Notation such as BJ, Cl, Di and E! ıs similarly defined. The 5 - 2' factors of X* 
are thus partitioned into five groups each of size 2°. 

Any two of the five factors A, B, C, D and E, say X and Y, generate 2! -2! two- 
factor interactions of the form X Y}, where i and j are 1 x t vectors with entries 
1 or —1. These interactions, (2t - . 2! = 10-2! -2! in total, are called between-group 
two-factor interactions. Those of the form X! XJ with i z j, 56 y= 5-2! (af — 1) 
in total, are called within-group two-factor interactions. 


DOUBLING AND PROJECTION 555 


Since the 257! design defined by I = ABCDE is of resolution V, there is exactly 
one two-factor interaction in each of its ten alias sets not containing main effects. 
By applying Theorem 2.2 t times, we see that, under X*, there are 10-2! + (2! — 1) 
alias sets not containing main effects, 10-2’ of which each containing 2! two-factor 
interactions, and 2! — 1 of which each containing 5 - 2'~! two-factor interactions. 

It can be seen that the 2' - 2! between-group two-factor interactions arising from 
the same pair (X, Y) are distributed evenly in 2' alias sets of size 2'. Each of these 
2! alias sets is of the form {X'Y):i© j =k}, where k is a 1 x t vector with entries 
1 or —1 and i O j is the componentwise product of i and j. We denote this alias 
set by X Y,. The ten possible pairs of X and Y account for the 10 - 2/ alias sets of 
size 2' mentioned in the previous paragraph. On the other hand, the 2^! (2! — 1) 
within-group two-factor interactions X! XJ arising from the same X are distributed 
evenly in the remaining 2' — 1 alias sets. Each of these alias sets consists of the 
5.2!-1 interactions A!AJ, B! BJ, CiCl, D! DJ, E E) with ioj =k. (Note that X! X3 
is the same as XJ X! and k Æ 1, where 1 is the vector of 1's.) We denote this alias 
set by Wk. 

One key property that is important for the proof is that each of the 2! factors of 
the form X* appears in exactly one of the 2' two-factor interactions in each alias 
set X Yy. As a consequence, if u factors generated by X are deleted from X", then 
the number of two-factor interactions in each of these 2' alias sets is reduced by u. 
In this case, since each factor of the 257} design can be coupled with four other 
factors, the number of two-factor interactions in 4- 2! of the 10-2! alias sets of size 
2! is reduced to 2' — u; that in each of the other 6 - 2' alias sets remains to be 2'. 

It can also be seen that each of the 5 - 2! factors appears in exactly one within- 
group two-factor interaction in each Wy. 

Now we are ready to prove the following theorem. 


THEOREM 4.1. For any N = 2* k> 5, and 9N/32 <n x 5N/106, the mini- 
mum aberration design must be a projection of the design constructed bv repeat- 
edly doubling the 257! design defined by I = ABCDE. 


PROOF. Let N — 16.2! and u = 5N/16 — n. Then since 5N/16 — 9N /32 = 
2!-! we have 0 <u < 271, 

The minimum aberration design with N — 32 and n — 9 is known to have at 
least one zero among the m;'s, 1 <i < f ([7], page 91), and therefore is not max- 
imal. In fact, it 1s obtained by deleting one factor from the maximal design with 
ten factors. By repeatedly applying Corollary 2.4, we see that, for each N — 2* 
with k > 5, the maximal regular design of resolution IV with 9N /32 factors does 
not have minimum aberration. Now since there are only two maximal regular de- 
signs of resolution IV with more than 9N/32 factors, one with N/2 factors and 
the other with 5N/16 factors, it is enough to show that, for N = 16-2! with t > 2, 
there is at least one (5N/16 — u)-factor projection of the maximal regular design 
of resolution IV with 5N/16 factors that has less aberration than all projections 


556 H. H. CHEN AND C.-S. CHENG 


of the saturated regular design of resolution IV. As before, let X* be the maximal 
regular design of resolution IV with 5N/16 factors. 

Now consider a design X obtained by deleting from X*u factors that are gen- 
erated by the same factor A of the 2?^! design defined by 7 = ABCDE. We shall 
show that X has fewer defining words of length four or, equivalently, by (2.2), 
a smaller value of X m? than all projections of the saturated regular design of 
resolution IV. 

By the observations preceding the statement of the current theorem, under X, 
4-2! of the 10-2! alias sets of between-group two-factor interactions have m, 
equal to 2' — u, and the other 6 - 2' alias sets have m; equal to 2'. Now we provide 
an upper bound on the sum of squares of the m, values over the alias sets of within- 
group two-factor interactions. 

Under X*, in each of the 2' — 1 alias sets of within-group two-factor interactions, 
21-1 of the 5 - 2/7! two-factor interactions involve factors generated by A. Thus, 
when only factors generated by A are deleted, for each of these alias sets, the 
resulting m, satisfies 4 - 2^! < m, < 5. 2171, Consequently, an upper bound on 
the sum of squares of the m, values over these alias sets can be obtained by making 
as many of the m, values equal to 5 - 2/7! or 4- 2*7! as possible. 

When u factors generated by A are deleted, (7) + u(2' — u) within-group two- 
factor interactions are also deleted. Write 


u(u 4-1)/2—a-2! ^! +b, 


where a and b are nonnegative integers such that b < 2^. Then 


(4.1) (5) +u(2'—u) = Opea2r™ — b. 
By the observation two paragraphs above, an upper bound on the sum of squares of 
the m, values over the 2! — 1 alias sets of within-group two-factor interactions can 
be obtained by assuming that 2u —a — 1 of the m, values are equal to 4-2/^ , one is 
equal to 4-2'~! + b, and the remaining 2’ — 2u +a — 1 values are equal to 5.2771, 
Combining this with the m; values for the alias sets of between-group interactions 
obtained earlier, we conclude that if u factors generated by A are deleted from X*, 
then 


f 
yim? «4.2. Q! —u «6.2 Q'Y + (4.2771 + py 


EI 


(4.2) 
RO -oga 6550/7559 eas: 7. 


On the other hand, a saturated regular design of resolution IV has all the (° dr 
two-factor interactions in N/2 — 1 = 2^? — 1 alias sets. Thus, a design obtained 
by deleting factors from a saturated regular design of resolution IV can have at 


DOUBLING AND PROJECTION 557 


most 2/*? — 1 nonzero m,’s. For such a design, 


PE 2!—u 


2 
(4.3) Emi >| a A (23 1). 


It is sufficient to show that the right side of (4.2) is less than that of (4.3). This 
can be verified by tedious calculations using (4.1) and the assumption thet 0 < b < 
2—1. The details are omitted. O 


Note that the case u = 0 provides an alternative proof of Butler’s result on the 
optimality of maximal designs of resolution IV with n = 5N/16. 

Theorem 4.1 leaves open the issue of which projection of the maximal design of 
resolution IV with n = 5N/16 has minimum aberration. A complementary design 
theory in the same spirit as that for saturated regular designs of resolution III and 
IV needs to be developed. 

Suppose u factors are deleted from X*. For X = A, B, C, D, E, let ny be the 
number of factors deleted from ied generated by X. Among the nyny pairs (i, j) 
where XÍ and YJ are deleted, let ny y be the total number such that i©j = k; this is 
the number of between-group two-factor interactions formed by the deleted X! and 
Y4’s that belong to the alias set X Yy. Similarly, among the (^) ordered pairs (i, j) 
where X! and XJ are deleted, let ny y be the total number such that i © į = K; this 


is the number of within-group two-factor interactions formed by the deleted X?’s 
that belong to the alias set Wy. Then the number of two-factor interactions in X Yk 
is reduced by ny +ny — n y, and the number of two-factor interactions in Wy is 


reduced by na +ng +nc+np ng — n4, — ig ~ c bp "EE = 
u — nh A — nb. B - nec — n5 p — n% p. Thus, a minimum aberration projection 
of X* must minimize 


SYO -ng -ny e nh y) 


X,Y k 


t 5x2" !-uanh 4 nb ga néctnbp nE gk). 
kA 


where, in the first term, the first sum is over the ten possible (X, Y)’s and the 
second sum is over all the 1 x £ vectors of 1’s and —1’s. While this can be solved 
without difficulty for small u's, more general results need to be developed. We 
expect an optimal strategy to delete the factors one at a time alternately from the 
five groups of factors generated by A, B,C, D, E, while making the two-factor 
interactions formed by the deleted factors as uniformly distributed among the alias 
sets as possible. 


Acknowledgments. We would like to thank the referees for their helpful com- 
ments. 


558 


[1] 
[2] 
[3] 
[4] 
[5] 
[6] 
[7] 


[8] 


[9] 


[10] 


[11] 


H H. CHEN AND C.-S. CHENG 


REFERENCES 


BLOCK, R. M. and MEE, R. W. (2003). Second order saturated resolution IV designs. J. Sta- 
tist. Theory Appl. 2 96-112. MR2040435 

BRUEN, A., HADDAD, L. and WEHLAU, D. (1998). Binary codes and caps. J. Combin. Des. 
6 275—284. MR1623650 

BRUEN, A. and WEHLAU, D. (1999). Long binary linear codes and large caps in projective 
space. Des. Codes Cryptogr. 17 37-60. MR1714367 

BUTLER, N. A. (2003). Some theory for constructing minimum aberration fractional factorial 
designs. Biometrika 90 233—238. MR1966563 

CHEN, H. and CHENG, C -S. (2004). Aberration, estimation capacity and estimation index. 
Statist. Sinica 14 203—215. MR2036768 

CHEN, H. and HEDAYAT, A. S. (1996). 27-1 designs with weak minimum aberration. Ann 
Statist. 24 2536-2548. MR1425966 

CHENG, C.-S., STEINBERG, D. M. and SUN, D X. (1999). Minimum aberration and model 
robustness for two-level fractional factorial designs. J. R. Stat. Soc. Ser. B Stat. Methodol. 
61 85-93. MR1664104 

DAVYDOV, A. A. and TOMBAK, L. M. (1990). Quasiperfect linear binary codes with distance 
4 and complete caps in projective geometry. Problems Inform. Transmission 25 265—275. 
MR1040020 

FRIES, A and HUNTER, W. G. (1980). Minimum aberration 247P designs Technometrics 22 
601-608. MR0596803 

PLACKETT, R. L. and BURMAN, J. P. (1946). The design of optimum multi-factorial experi- 
ments. Biometrika 33 305—325. MR0016624 

RAKTOE, B. L., HEDAYAT, A. S. and FEDERER, W. T. (1981). Factorial Designs. Wiley. New 
York. MR0633756 


[12] SUEN, C.-Y., CHEN, H and WU, C. F. J. (1997). Some identities on q" designs with 
application to minimum aberration designs. Anz. Statist. 25 1176-1188. MR1447746 

[13] TANG, B., MA, F., INGRAM, D. and WANG, H. (2002). Bounds on the maximum number of 
clear two-factor interactions for 2" P designs of resolution III and IV. Canad. J. Statist. 
30 127—136. MR1907681 

[14] TANG, B. and Wu, C. F. J. (1996) Characterization of minimum aberration 2n—k designs in 
terms of their complementary designs. Ann. Statist. 24 2549-2559. MR1425967 

DIVISION OF BIOSTATISTICS AND BIOINFORMATICS DEPARTMENT OF STATISTICS 

DEPARTMENT OF EPIDEMIOLOGY UNIVERSITY OF CALIFORNIA 

AND PREVENTIVE MEDICINE BERKELEY, CALIFORNIA 94720 
UNIVERSITY OF MARYLAND SCHOOL OF MEDICINE USA 
113 HOWARD HALL E-MAIL cheng @stat berkeley edu 


660 WEST REDWOOD STREET 
BALTIMORE, MARYLAND 21201 


USA 


E-MAIL. hchenG epi umaryland edu 


The Annals of Statistics 
Vol. 34 April 2006 No. 2 


Boosting and Thresholding 


Boosting for high-dimensional linear models ... .  . es... . PETER BUHLMANN 


Adapting to unknown sparsity by controlling the false dieser rate 
FELIX ABRAMOVICH, YOAV BENJAMINI, DAVID L. DONOHO AND IAIN M JOHNSTONE 


Semiparametric and Nonparametric Methods 
Inference for covariate adjusted regression via varying coefficient models 
DAMLA SENTURK AND HANS-GEORG MULLER 


Adaptive goodness-of-fit tests 
in a density model.......... usse. . MAGALIE FROMONT AND BÉATRICE LAURENT 


Tailor-made tests for goodness of fit to semiparametric hypotheses 
PETER J. BICKEL, YA’ACOV RITOV AND THOMAS M. STOKER 


The behavior of the NPMLE of a decreasing density near the 


boundaries of the support... . . . VLADIMIR N. KULIKOV AND HENDRIK P LOPUHAA 
Bayesian Analysis 
Frequentist optimality of o e wavelet shrinkage rules for Gaussian 
and non-Gaussian BOISE... . . kf eee eee cece eee serio ... MARIANNA PENSKY 
Shrinkage priors for Bayesian redictian Pax px pO ULIGDEDEY FUMIYASU KOMAKI 
A Bayes method for a monotone hazard rate via S-paths . sca darc . MAN-WAI Ho 
Masspecification in infinite-dimensional 
Bayesian statistics. m . . . B. J. K. KLEUN AND A. W. VAN DER VAART 
Iterative Proportional Fitting 
An iterative procedure for general probability measures to obtain 
I-projections onto intersections of convex sets .. ..... ... BHASKAR BHATTACHARYA 
Survival Analysis 


Asymptotic theory for the Cox model with missing time-dependent covaniate 
JEAN-FRANCOIS DUPUY, ION GRAMA AND MOUNIR MESBAH 


Product-limit estimators of the survival function with twice censored data 
VALENTIN PATILEA AND JEAN-MARIE ROLIN 


Graphical Model Theory 


Characterizing Markov equivalence classes for AMP 
chain graph models ...... .... ... STEEN A. ANDERSSON AND MICHAEL D. PERLMAN 


Topics in Time Series 
Explicit representation of finite Drains coefficients 
and its applications. ..... .. ... .AKIHIKO INOUE AND YUKIO KASAHARA 


Frtting an error distribution in some heteroscedastió time series models 
HIRA L. KOUL AND SHIQING LING 


Strong invariance principles for sequential Bahadur-Kiefer and Vervaat 
error processes of long-range dependent sequences 
MIKLÓS CSORGÓ, BARBARA SZYSZKOWICZ AND LIHONG WANG 


Correction Note 
Efficient parameter esumation for self-sumilar processes. ..... . ..... . RAINER DAHLHAUS 





Price 
US$35 


IMS 
Member 
Price 
US$21 


LECTURE 
NOTES — 
MONOGRAPH 
SERIES 


LNMS Volume 47: 
Recent Developments in 


Multiple Comparison Procedures 
Yoav Benjamini, Frank Bretz and Sanat Sarkar, Editors 





` 
l. 
i 
fa 
4 
F 
i 
i 
f 
! 


Mai Ge 


This volume is based on the NSF-CBMS Conference in Mathematical Sciences | 
New Horizons in Multiple Comparison Procedures held in August 2001 at Temple 


University, Philadelphia, with Yosef Hochberg as the key speaker. 1 
* > 1 
The conference was organized in response to the sudden upsurge of research 
that has taken place in the area of multiple comparisons. A number of newer 
ideas have emerged from the recent research activities that could generate a 


steady stream of new research. 


i 


This volume is a collection of 11 papers, covering a broad range of topics ini - 


Multiple Comparisons, which were invited for the conference. The goal of ic 


| 
| volume is to gain deeper understanding of these ideas and to promote further: | 


. research activities in this area. 5 
[ i 


Order online:  http://www.imstat.org/ k 
Or send payment (Mastercard/Visa/American Express/Discover, or check payable on a US bank In US funds): 


Institute of Mathematical Statistics, Dues and Subscriptlons Office | 
9650 Rockville Pike, Suite L2407A, Bethesda MD 20814-3998, USEF 


Tel: (301) 634-7029 Fax: (301) 634-7099 Email: staff@imstat.ore:; 


